Hacker News new | comments | show | ask | jobs | submit login
Coding Horror and blogs.stackoverflow.com experience "100% Data Loss" (codinghorror.com)
223 points by gfunk911 2485 days ago | hide | past | web | 165 comments | favorite



To be fair, Atwood thought his hosting provider (CrystalTech) was backing up his system. As it turns out, their entire VM backup solution failed silently, so everyone thought the backups were being made.

If anything, I'd say this is a sign not to work with CrystalTech.


Also to be fair the backup process Atwood uses for Stack Overflow proper (et al) appears to be much more robust (regular offsite backups, etc.)

The failures in this case appear to be:

1. Not taking the robustness of his blogs seriously enough (a LOT of people make this mistake, especially with their own content).

2. Being overly trusting of the procedures of his hosting provider. He thought he could trust his hosting provider's "backups" since they are a big company with lots of customers and he paid money for said backups, turns out he got ripped off.

3. Forgetting the maxim that you have to own your core competencies, and existence should always be a core competency.


It doesn't seem to be particularly fair to the hosting provider, really, to put their name on the failpage and to tweet about how it's half their fault. Especially after you've advocated redundant backups, implied that you have them, written about the advantages of hosting your images on S3. When a mishap reveals that you've actually done none of these things, it is more than a little disingenuous to try to emphasize your hosting provider's relatively minor role in making you look foolish.


Relatively minor?! They failed to do what they said they would. They destroyed data and borked up the backups.

If it wasn't Jeff Atwood but me, would it be more their fault? Or would I just be less disingenuous?

Nobody's calling them names, or threatening to take business elsewhere or anything. But it's good to get sunlight in there, show them consequences, when they fail you. When a company providing you with service fails, you're allowed to scream it from the rooftops if you feel like it.


Their role in making him look foolish (what I said, incidentally) is indeed quite minor. If he'd actually done the things he advocated and, frankly, implied that he'd done, he'd still have his images, his downtime would have been close to zero and he'd probably get to write a triumphant article about the value of following his own sage advice.

Someone else pointed out in another comment, there's a good analogy to be drawn to Atwood's own words here: http://www.codinghorror.com/blog/archives/001079.html


"Their role in making him look foolish (what I said, incidentally) is indeed quite minor."

Fair enough. But he's not complaining about looking foolish (nor is grandparent as far as I can see). He's complaining about losing data.

In that link he isn't saying "it's your fault no matter what the world throws at you, suck it up." He's saying, "look harder at yourself before you decide somebody else is to blame." Not an issue here; can we agree it is their fault?


for the record, I blame 50% hosting provider, 50% myself

Is his tweet. Sounds like blaming to me. I suppose it's a matter of personal taste but if you write about how you're serving your images off S3 and how everyone should be making redundant backups and then get bitten by the fact you somehow wrote about these things but didn't actually do them, you're not looking very hard at yourself.

My simple point is - if you're an advice-giver who's been shown to not follow his own advice on backups, don't compound it by being an advice giver who doesn't follow his own advice on humility.


"Sounds like blaming to me."

I thought we were arguing about whether he cares about looking foolish.

Of course he's blaming them. There's reason to blame them. Regardless of what he did or didn't do, they screwed up. In fact I think he's being very restrained and humble taking 50% of the blame. I'd be fucking pissed if that happened to me. Even if I restored from backups in 30 minutes (extremely optimistic).

"if you're an advice-giver who's been shown to not follow his own advice on backups, don't compound it by being an advice giver who doesn't follow his own advice on humility."

Again, the humility he was referring to was to assume first that it was your fault until you determine otherwise for sure. What's that got to do with them screwing up?

You're basically saying: if I act holier-than-thou and advocate backups, then Slicehost or Amazon can give me crappy service because I'm not allowed to 'blame' them or mention how they suck. Which is really weird.

(I feel for the guy. I'm sitting here yakking because I can't connect to my EC2 slice for the last n hours.)


Heh, we are absolutely going to validate the 'comment value drops exponentially with nesting' rule.

I didn't say he deserves crappy service or that his provider didn't screw up by firing his VM into Neptune. What I'm saying is (in 17 different ways, at this point) - if you are a vocal advocate but don't actually practice or believe what you preach, you're not really an advocate or useful commentator but a hack. When circumstances (that incidentally and in part happen to be someone else's fault) expose you as somewhat of a hack, it's a little uncouth to be pointing fingers at others for their much more minor screw-ups. Being a hack is bigger screw-up than accidentally (or through negligence) firing someone's VM into Neptune, that's all.


I'd argue that the general idea expressed in the post below applies as much to system administration as it does to programming. Rule #1: Take responsibility.

http://74.125.93.132/search?q=cache:xGcmkwq9zUkJ:www.codingh...


If anything, I'd say this is a sign not to work with CrystalTech.

After a mistake like this, it may be a sign that they will never mess up again.


I disagree - it's a sign that they will never mess up their VM backup system again.

It's also a sign that they lack experience and competence. So while your odds of suffering this problem are significantly lower, your odds of the infinite number of other problems are still troubling.


We actually had our app hosted at CrystalTech until a couple months ago. Earlier this year, they had a critical power outage, which took our site offline. That's one of the reasons we're not there anymore.


Well, I'm a bit more cynical than you. However, Jack in the Box is probably one of the least likely chains to experience an E. Coli O157:H7 outbreak (since 1993, at least).


If you fail to audit and test your disaster recovery procedures on a regular basis, then you fail at competently maintaining your infrastructure. No excuses.


I'm not disagreeing, I'm just saying that before people get drunk on schadenfreude they should put themselves in Atwood's position. If you pay someone for a service you generally expect it to work when you need it.


Yeah, but if you pay for a mission critical service and never bother testing it, you've pretty much passively decided to fail.


Joining the chorus of other people saying "WRONG".

Reason 1: I pay for my bank account one way or another. It's bank's responsibility to keep the system running and secure, not mine. Sometimes we just can't do everything ourselves and have to rely on others.

Reason 2: Many of us host something somewhere. How many do backups? How many of us check the backups? How many do check the backups checking process? (enter infinite recursion) You have to stop at some stage. You probably don't have enough time, or your time is not worth enough to check that.

He can recover most of his blog from the caches, because it was quite popular. I bet he'll be back with an almost complete archive in less than a week.


You have the perfect right to just point and say "you were supposed to do that", while losing all of your data.

The rest of us will be quite happy to ensure that what we think is happening, actually is.

It's akin to a pedestrian getting hit by a car and then saying "I had the right of way". Yeah, maybe so, but now you're in the hospital breathing through a tube.

Those of us who checked for traffic before crossing the street are at home watching TV.


The bank analogy is interesting and a good point, but the reality of ISPs is that they are less reliable, and less regulated, than banks.

If my income is dependent on data, I make sure it gets backed up. If it's an active project, I use my backups to build my dev environments so it's fairly obvious when a backup has failed.


You have good habits, and that's sincerely commendable.

However, it really isn't fair to blame this kind of problem on the end user.

Are website developers also expected to keep the servers secure? If Apache isn't patched and up to date, is that the website admin's fault?

Service providers are paid to do a job. This one failed tremendously, and should lose a large chunk of business for it. I think Atwood is being far too kind in accepting 50% of the responsibility for the data loss.

To put it another way: the only reason to maintain your own backups of your site data -- aside from healthy paranoia -- is because you expect your service provider to fail at doing their job. And if that's the case, shouldn't you be finding a service provider that does it better?


To put it another way: the only reason to maintain your own backups of your site data -- aside from healthy paranoia -- is because you expect your service provider to fail at doing their job. And if that's the case, shouldn't you be finding a service provider that does it better?

If you don't expect your service provider to fail, you don't know anything about service providers.

Everybody fails.

Our colocation facility has redundant generator systems. They're tested regularly, and have handled failures previously. Yet, when the power went out, three of the backup generators failed, and our site (as well as Craigslist, Yelp, and others) was out for 45 minutes.

The cause? A bug in the backup generator's software: http://365main.com/status_update.html

Shit happens. Sometimes it's not your fault. You still need to prepare for it.


>Are website developers also expected to keep the servers secure? If Apache isn't patched and up to date, is that the website admin's fault?

Getting hacked isn't as potentially catastrophic as not having backups. With backups, being hacked can be recovered from and the service provider changed.

>To put it another way: the only reason to maintain your own backups of your site data -- aside from healthy paranoia -- is because you expect your service provider to fail at doing their job.

A business doesn't expect its premises to burn down, but most have fire insurance in the event that this happens. Even if I don't expect my service provider to fail, there's no way to know that they won't and it makes sense to deal with this risk if the cost of dealing with it is reasonable and the cost of not dealing with it is catastrophic.


> It's bank's responsibility to keep the system running and secure, not mine.

As others have pointed out, it is your responsibility to check your bank statements to ensure that all is secure. But also, with banks you're dealing with money that is easily replaceable. If your card is compromised that money is gone but if it's their fault (or criminal) the bank will replace that money. Your data is not replaceable -- when it's gone, it's gone.

"Many of us host something somewhere. How many do backups? How many of us check the backups? How many do check the backups checking process?"

True you do have to stop somewhere -- but there is due diligence and there's negligence. Making sure you are making offsite backups and periodically testing them is due diligence. On the otherhand, completely relying on someone else to backup your critical data is negligence. If your data is at all important to you, then you need to have a copy. But at some point you have to accept that you've done enough.


Reason 1: I pay for my bank account one way or another. It's bank's responsibility to keep the system running and secure, not mine. Sometimes we just can't do everything ourselves and have to rely on others.

You are responsible for reading your statement and ensuring that all activity is valid, much in the same way that you are responsible for ensuring the viability of your recovery strategy.

Reason 2: Many of us host something somewhere. How many do backups?

Anyone who wants to keep their data badly enough to pay (time, money) to do so. Coding Horror is a popular technology blog, and there's a significant cost in traffic and credibility when it fails.

Jeff Atwood lacked cognizance of the risks and made an incredibly poor technical and business decision in failing to validate correctness of his backups and implement a suitable recovery strategy.


You cut a lot from reason 2. The was an important part. I've seen a system where the backups were made. They were verified too. Only after an actual data loss it was discovered that the backup verification was faulty and half of the "verified" and "properly backed up" data was missing.

My point is that you can spend lots of hours trying to backup your data and verify its correctness. But unless you put it back into the actual working environment and check every single bit of it, you cannot be sure it was a proper backup - not with 1GB of information and certainly not with 1TB. And then after you verified that you can verify that you have backups, something will fail and a bunch of people will say "if you didn't check the backups properly, it's your fault". I've seen backup systems fail in amazing ways and will probably never again believe that you can be "sure".


It's interesting that you should mention banks. Banks routinely go under, and in almost all cases your deposits are only 'safe' up to a certain amount. Most data stores are worth well in excess of that amount, so if you could not reasonably expect a bank to hold on to a very large amount of data it stands to reason that you should not trust some service provider with data that has serious value without a secondary system in place.

After all, their liability will almost certainly be less than the value of your data, in which case there are two reasons to keep extra copies, both business continuity reasons and direct economic ones.


1. Banks are more reliable than hosting providers. 2. For a small personal site you don't need anything fancy, just make sure you've SFTP'd the files to your disk and snapshotted the DB once every couple months. My personal site is backed up by Rackspace cloud, but I still do an occasional rsync just in case they fuck up.


Banks make a bad analogy.

Sure, it should be the bank's responsibility to keep accounts safe. However, the fact that bank accounts are considered to be "safe" is in a large part due to systemic factors like the existence of the Fed (as lender of the last resort), FDIC, general political climate of "too big too fail", etc. As mentioned above, banks actually do go under a lot; it's just not as noticeable to the clients as before [the central banks], but the cost is there. (And there's a whole discourse there of whether it's a good idea for monetary system to have this kind of environment in the long run.)

In any case, nothing of the sort applies to the hosting providers (or even can be applied, as insuring unique data is not the same as insuring amounts of money) -- so assuming that keeping data there is as safe as keeping money in the bank is not a viable comparison.


I think there analogies for both sides of the argument. In the case of bank accounts (or bank security) I don't test it by trying to break in, check how it works with id theft etc ;)

But for backups, at least in this (early) era of manage hosting/cloud, its probably not too onerous to test your plans etc...


Is codinghorror really mission critical? He will restore from some other backups. Downtime on a blog is not that big a deal other than people like to hate on Jeff Atwood. Now if he lost customer credit card records or something like that, then it would be warranted.


Doesn't Jeff derive a significant portion of his personal income from the site? If so, I'd say mission critical is a good way to describe it.


If codinghorror somehow allows Jeff to live without working then I would say it is pretty mission critical. Mission critical doesn't mean the Earth will blow up or something like that.


The trick is to simulate a data loss before it happens.


The same could be said for CrystalTech


I don't think anyone's saying that CrystalTech didn't also fail.


Hmm,

It's a lesson on what can and can't be "just bought".

Security and backups both need some top-down hands involvement. Also US recent experience dikes and air defenses are important too.


Do you also test that your airbags work properly? I don't audit and test the safety measures on most of the equipment in my house: that's what I pay the supplier for and there'll be hell to pay when they screw it up. Similarly, I don't audit and test my disaster recovery procedures, because that is what I pay my hosting partner for.


"Reasoning by analogy"

The people who made your car were legally required to test your model of car by crashing it into another one and making sure the airbags work. This is also too expensive to test yourself. One does not modify an airbag system at all without retesting it.

Your hosting provider has no legal requirement to test their backup system. At best, they have a contractual obligation. And they don't know how to test recovering your site, because testing whether it works is a different process for each site. Additionally, it's cheap to test backup procedures. Most people have a spare computer somewhere (maybe not in the data center), and it should only take a few hours to restore a copy of your site. Once a year. For Chris's sake, you could probably do it in the background and put a movie on.


When you test airbags, you have to then replace them. This is possibly the worst car analogy ever :-)


Yes, but I'm not sure if your comment is targeted at CrystalTech or codinghorror?

You can't outsource liability.

If YOU are not testing YOUR backups. YOU fail.


It's a blog. How much time and money is worth spending on this?

(I backup my Slicehost account with their official service. If they die and my backup dies with it, that's fine; I don't have the time or money to do anything better.)


I'd say a lot to Jeff, not only is it a popular blog standing alone and probably generates a decent income but it's also very important to him for cross promotion.


And do you think Jeff's blog just got less popular?


He's not serving any AdWords off that page :-P


his position as looking down on coding horror has somewhat moved to being part of the team. not necessarily a bad thing.


You don't need to put much time or money into it. The bare minimum of a weekly cron-automated rsync job can be done in a few minutes, even if you've never done it before.


If I was in his shoes, yeah I'd be freaking out too and trying to get everything recovered ASAP. Then again my blog pays more than my (startup) job so that's why I care about it.


You can't outsource liability.

So you've tested whether your toaster has proper grounding and other safety precautions, in case it shortcircuits? I bet you haven't and that why you should stop repeating that stupid soundbite. We all 'outsource' liability all the time: we pay others to perform services for us and hold them responsible for the proper execution of those services. This includes hosting content and backing up that content.


So you've tested whether your toaster has proper grounding and other safety precautions, in case it shortcircuits?

Nope. I have a fully paid and verified home owners insurance policy though that covers any probable loss from a toaster fire.

It's sort of like having an offsite backup of my important personal stuff.

Your example is really not a parallel to a data backup. It's difficult to fully test all of the toasters failure modes in a non-destructive way. It's easy to setup an automated backup to a remote location. Given the gmail storage limits, you could tar and gzip your files and email your gmail account with the backup data.

A better example in a "homeowner" realm is the flood pans that you can buy to put under your water heater or washing machine. You'd like to be able to trust that the manufacturer made a waterproof device, but at the same time it's cheap and easy to insure yourself against the most common failure modes.

The sound bite is not stupid. People who believe you can outsource all the messyness of keeping a website alive are continually bitten in the ass by EC2, Rackspace, etc. failures.


An insurance policy is equivalent to an offsite backup? So you don't have any personal items of any intrinsic value at all; you could lose it all, get a cheque in return, and be happy? Wow. Well, awesome, but I doubt that is common.

And look, I agree with you in many ways - people need to take responsibility for their own backups, sure. But there is a division of responsibility. I mean, even if all my backups are 100% perfect, I am still trusting the hosting provider to, well, keep the power on. Pay the peering bill. Keep the server temp down. Not go bankrupt tomorrow.

You can just follow this chain as far as you want. Whether you like it or not, you're utterly dependent on the DNS root server admins. There is nothing at all you can do to prepare yourself for their failure. I bet you can't generate your own electricity or grow your own food, either.

All of civilisation is built on co-dependency and delegation of responsibility. It's the only way to do anything complex. At some point, you must delegate. Atwood should have checked - but he was paying his host to do it. That's like having an employee whose job it is to do backups. At some point, you just have to let go and trust them. Otherwise you can never really do anything; you're caught up in checking minutia; I can give examples of this kind of leadership failure until my keyboard breaks.


1. Register an Amazon AWS S3 account - https://aws-portal.amazon.com/gp/aws/developer/registration/...

2. Download my S3 backup script (or anyone's S3 Backup script) - http://github.com/leftnode/S3-Backup

3. Set up a cron to push hourly/daily/whatever tar's of your vhost's directory to S3.

Spend, like, $10 a month. Thats 30gb of storage, 30gb up and 30gb down. Now, I know that may not be a lot, but I doubt codinghorror.com had that much data.


I do the same thing, with a script using ruby s3sync I wrote a while ago (that I should probably update):

http://paulstamatiou.com/how-to-bulletproof-server-backups-w...


I'm using your script (plus zipline and automysqlbackup) on several blogs. Thanks!

In case anyone else is interested: http://code.google.com/p/ziplinebackup/ https://sourceforge.net/projects/automysqlbackup/


s3sync is great. I'm using it for automated backups for a number of work projects. Thank you for writing it and sharing it.


Oops I should clarify: "script using ruby s3sync I wrote a while ago"

I wrote a bash script that uses s3sync (not written by me). You can thank the s3sync community for that! :-) http://s3sync.net/wiki


Just wanted to say thanks for the article -- I'm using your script for my blog :).


Thanks for s3sync!


Step 4, most importantly, should be to verify those backups on a regular basis. All too often, people lose data due to a failure and then a backup which was failing or corrupted.


good outline. An alternative, which I use, is tarsnap http://www.tarsnap.com/.


There's also tarsnap, run by HN's very own cperciva.


And Duplicity, which cperciva claims is theoretically less secure, but which seems to work pretty well for me. (Who cares if the NSA can decrypt my blog backups? They are already available unencrypted...)


Who cares if the NSA can decrypt my blog backups

Confidentiality is only one aspect of security. Authenticity is also important in some cases: You might care if the NSA edits your backups so that after restoring them it looks like you said something you didn't really say.


If you push your backups, make sure the account used does not have permissions to modify or delete the files it is creating offsite.


Yes, this! Push backups are inherently risky -- if at all possible, backups should either be pulled by the target, or should be mediated by a third system. Otherwise, you risk an attacker deleting all your backups along with your data.


Thanks for the tip. I've been doing the same thing Atwood has been doing. So I just installed your stuff and have it running.

A tip for folks using leftnode's setup: I didn't notice $gpgRecipient hidden at the end of the config file and chased around a bit looking for it.


Thanks for using it!

I use it to backup all of my databases and entire vhosts directory every hour and night.

Let me know if you find any bugs!


I recently moved some servers over to this for encrypted daily s3 backups:

http://github.com/astrails/safe

Dead simple to specify what to backup, drop the config in /etc/safe and add a cron job.


"looks like it's 100% internet search caches for recovery. Any tips on recovering images, which typically aren't cached?"

This does not sound like the tweet of a man with backups.


"ugh, server failure at CrystalTech. And apparently their normal backup process silently fails at backing up VM images."

http://twitter.com/codinghorror/status/6573094832


It's not a backup if it's in the same location and managed by the same process.


Seriously, this. 1000 times this!

Why would anyone consider a backup on the same VM a backup at all?

All your data should be kept in two places, hopefully geographically disparate in case there's a building fire or horrible storm or what not.

You should also test it on a regular basis to make sure it exists (and so you know what to do when the shit hits the fan)


Also, if you don't regularly test backups they might as well not exist.


If you regularly spend time testing backups, you should start paying someone to do that. For instance, the guys who make the backup. Who should provide that as part of the service. Which mean they are responsible for failures and can be held accountable.


egads... methinks CrystalTech will be experiencing a mass exodus.

[Also makes me wonder how to test my host's backups]


[Also makes me wonder how to test my host's backups]

Make your own, don't rely on someone else.


Straightforward enough. 1) Get another host, preferably one a thousand miles away from your present provider 2) Do what you need to do to get a copy of your site running there 3) Refine/automate the process and practice it 'til you can do it with your eyes closed


Wasn't he using Amazon S3 for images?

Cache of the relevant blog post: http://74.125.95.132/search?q=cache:yE0OdU1q6q4J:www.codingh...


Let me get this right: the guy who ran a site whose entire purpose is to make fun of other people's poor programming practices has no external backups?


I think you're thinking of Daily WTF. Jeff's blog is/was just a programming blog.

He also frequently said he was the world's worst coder, so...


Well, he did make sure to tell us to shut up, stop what we're doing and make a back-up. Because he knows things, you see:

http://74.125.93.132/search?q=cache:2HHNAk2SB6EJ:www.codingh...


Jamie Zawinski said to shut up, stop what you're doing and make a back-up. Jeff simply spread Jamie's message to a wider audience, rather than claiming it as his own.


Amusingly, he also quoted jwz as saying "The universe tends toward maximum irony. Don't push it." This incident appears to be proof positive of that.


The site says he has a backup, right? So it sounds like he ... does have a backup.


Yes, except he doesn't. Hence his tweets and his own recent question on stackoverflow about recovering his content from internet caches.


Ouch. I wanted to give him the benefit of the doubt, but I guess he is just fucked.


Actually, it was JWZ who said that.


Well, he was right.


You're right, I often confused Jeff's blog with Daily WTF. Jeff might be self-effacing, but at the same time he also dispenses a lot of programming and software development advice.


Conferring advice does not imply a claim of authority. If I give someone directions to a coffee shop in New York, the recipient should not infer that I am an expert on New York, traffic grids, coffee, etc.


If we are describing formal logic, then fine. (Although actually, not even fine there, really. There is a fairly powerful theory of "conversational implicature" that I studied once upon a time: http://plato.stanford.edu/entries/implicature/ It tries to tease out some of what makes conversation (which is gappy and missing a lot logically speaking) as effective as it is.)

If we are describing human conversation, then - um - I call bullshit. If you give advice, you most certainly imply (for a suitably non-formal definition of 'imply') authority, knowledge, etc. about the matter you give advice on.

Your example - or the way you spin it out - is very odd. If you give directions to some coffee shop, we should be able to assume that you know where the coffee shop is. I suppose at the very least we should be able to assume that you believe that you know where the coffee shop is. The guy gave advice on backing up - he passed along someone else's advice, and said he it was essential advice. The relevant switch for your analogy would be asking him to be able to describe the code in rsync or dd or whatever. Nobody is doing that. They're simply pointing out that he didn't practice what he preached (with various degrees of unfortunate serves-him-right schadenfreude, but never mind that for the moment).

Answering a question implies a claim of authority. Offering advice on the internet (unsolicited) is the equivalent of building a freaking billboard with directions to the coffee shop.


If, on the other hand, you set up a kiosk labeled "NYC Directions" on a corner, I'd assume you knew more about getting around the city than your average bear.


Well - he is human though he can have a new site now called "backup horrors". All kidding aside - this is horrible. How many people validate their backups on a regular basis? (even once a year)


Automatic validation is done through every backup (and I have my guys do a review of the logs from the nightly cron procedures). Once a month we test our restore procedures.


This is SOP in the real world.


Yup jokes aside. If you are not doing external backups and checking it at least once a month, you are doing it wrong. You can blame your host as much as you want, its equally your fault.

I learned it the hard way.


As I said in another thread, if you regularly refresh your development environment from your production backups, you get them verified "for free" and your ops people get plenty of practice, you never have to run a "DR exercise" to know you can.


No it's not.

It should be. It might even be in some of the plans. But most of the time it doesn't happen.


Ditto. Also, for the stuff I don't automatically validate I run a script to check file sizes for sanity.


I don't think that was the main purpose of his blog. I'd point you to some of his posts supporting this, but... :)


"Coding Horror experienced 100% data loss at our hosting provider, CrystalTech.

I have some backups and I'll try to get it up and running ASAP!"


Read his newer tweets - his backups were just copies on the same server; they are gone as well, and now he's resorting to using internet caches to restore his content.


There are tools to retrieve data from Internet caches, like Warrick - http://warrick.cs.odu.edu/warrick.html .


[facepalm]


Programming is not system administration. As a programmer, I learned long ago that I can pay people to keep systems up, and that they do better than me and cost less than me.

Handling a failing disk is the job of a sysadmin, not a programmer.


I've learned never to trust anyone else with your critical data. I try and avoid doing sysadmin tasks (I hate it) and I pay for it but I make sure the backups are working and at least some copy is under my direct control.

Sometimes I'm extremely shocked at how cavalier professionals are at maintaining your data and I've worked with plenty.


If your a one-man show, you have to develop enough sysadmin skills for critical tasks such as offsite backup. I do have sympathy for Jeff, I'm certainly not laughing at him. Hope this is a strong message to the many out there not doing it right.


I want to make a snarky comment, but I just feel for the guy.


I feel for the guy too, but it really isn't the first time in the last couple of months that some service provider fails at their #1 stated goal: to keep your data safe.

And it isn't just the small ones either.

For the love of - insert your favorite deity here - please try to restore your backups, and try to do so on a regular basis. If not all you might have is the illusion of a backup.

It is a very easy trap to fall in to, and I really am happy this guy writes about it because it seems there are still people that feel that if their data is in the hands of third parties that it is safe.

That goes for your stuff on flickr, but it also goes for your google mail account and all those other 'safe' ways to store your data online. In the end you have to ask who suffers the biggest loss if your data goes to the great big back-up drive in the sky, the service provider or you. If the answer is 'you' then go and make that extra backup.


There's an added bonus to restoring backups on a regular (daily) basis: a one day old instance of the production environment is always available as a playground for training, dev or qa.


Too many of us have been there. Coincidentally I started my own backup process about 20 minutes ago before I saw this story.


I'm with ya. This sort of thing can sneak up on you.


"I had backups, mind you, but they were on the virtual machine itself" (http://twitter.com/codinghorror/status/6577510116).

Sigh. When we will ever learn?


I sent jeff tarballs from blekko's webcrawl for www.codinghorror.com, blog.stackoverflow.com, www.fakeplasticrock.com and haacked.com - about 6300 pages overall. He's got Coding Horror back up from the basic html.

Unfortunately we don't have the images, but it looks like most of the site is back up at least. It will probably be more work for him to re-integrate it into the cms though.


Here's Google's cached version of a post of his about backup strategies:

http://74.125.95.132/search?q=cache:2HHNAk2SB6EJ:www.codingh...

It ends in what is now irony:

"If backing up your data sounds like a hassle, that's because it is. Shut up. I know things. You will listen to me. Do it anyway."


The part you quoted is a quote itself, which could change the meaning for readers like me.


The second sentence was. He quoted it earlier and repeated it at then end as a clear endorsement and agreement of the statement.


I sympathize, but I have little patience for calling out your data provider when disaster strikes. Sure, there is likely some technical fault on their end, but ultimately you are accountable for your own site.

I'm not going to imply anything negative about him for losing the site temporarily. We all have "learning experiences"...

I just hope that when he does come back, he posts an insightful analysis of how he could have done more for his own reliability, rather than point fingers at a vendor.


How is it pointing fingers if your hosting provider lost your site plus their backups of your site? They were being paid to provide a service, and they failed at it in basically the worst possible way (i.e. total data loss, rather than just downtime). Sure, he should have had his own backups, but how does the fact that he could have recovered better form his host's screw up change the fact that they screwed up?

That's not too far from saying that it's your fault if you get hit by a car while walking across a crosswalk because you didn't jump out of the way fast enough.

They failed, they failed in a catastrophic way, and they deserve to have it made public knowledge and to lose business over it. He should do a better job backing up his own stuff, but he's right to be angry with his hosting provider and to call them out.


He is right to be angry, but I have to disagree about calling them out while the site is still done.

An analysis posted after the fact can lay out where the technical failures took place, and that is the right time to describe issues with the host.

Pointing fingers in anger is very frowned upon in all organizations I work with. It implies a lack of ownership of your own products and a lack of maturity in handling your business affairs. Organizations that pursue blame before solutions do not have a positive culture -- they have a fear-based culture.

To answer your direct question, nothing he says will change the fact that the vendor screwed up. Your analogy is correct that it is the drivers fault if someone hits you, but flawed in that you cannot absolve all responsibility for your own safety when crossing streets.

When a vendor screws up, the level of professionalism that you portray when dealing with the situation says a great deal about yourself.


Point taken regarding calling them out before the situation is resolved: it would be far more respectful to allow the provider a reasonable amount of time to correct the error before explicitly attacking them. In this situation, I can understand it a little more, as recovering the backups from various internet caches might seem like a time-critical operation, and it would be difficult to ask for help with that without some explanation of what's happened. That explanation certainly could be fairly vague, though. Once the situation is resolved, though, if the resolution is unsatisfactory I think it's fair to say so.

I really took issue with your initial comment because "pointing the finger" has some connotation of assigning blame unfairly or unreasonably; I 100% agree that companies with a culture of blame are poisonous, and that people should err on the side of accepting too much responsibility rather than too little, and do their best to not pass blame on to other people.

There's certainly a line, however, across which I think it's reasonable to call someone else out. Where that line is depends on the situation and your relationship: the bar for doing it within your team is astronomically high (you should basically always deal with those things internally), within even the same company is still incredibly high (likewise), but it's lower when it comes to vendor relationships. Where you draw the line is probably different from where I draw it.

So while I agree that he should have waited, I don't think that publicly expressing his anger with his hosting provider after the fact would count as "pointing the finger" or "passing the buck" or otherwise indicative of a lack of personal responsibility; to me it would be understandable frustration and anger out of having been so dramatically let down by a third-party you were contracting with. And honestly, that sort of negative public publicity is one of the strongest checks we have on companies, be they hosting companies or retail stores or any other type of establishment.

More to the point: even if he did have backups, if they really lost his data and were unable to recover it themselves, I think he'd still be justified in outing their failure publicly after the fact. But again, you're definitely right that he should have waited and given the host a chance to resolve the issue before saying anything.


It's not the sort of thing I like to read, but still, it makes me cringe in sympathy.

/me does a git pull from his weblog


The next Stack Overflow podcast should be quite interesting...


This is why I like restoring my backup slice every so often, just to make sure it's there. A couple years ago I had to use archive.org as my external backup, and it wasn't fun.



Looks like someone changed something near the end of June '08 that prevented archive.org from storing the site. Gotta suck.


they have a lag (it's supposed to be 6 months, but they're trailing that at the moment).


Yeah, I knew about the lag, but 1 1/2 years of lag seemed to signify other issues to me.


another data point here (i have been wondering about what is happening ever since this thread) - i just noticed that archive.org's bot has crawled my web site in the last month (or, at least, something identifying itself as such in the logs).


When people forget to save a document and close the program, among the "ha ha I'm so superior" replies are a few people advocating unpopular niche systems which are nice enough to autosave.

When people make mistakes and save them over the top, there among the smug laughter are a few people on the user's side reminiscing about long forgotten systems of old which have versioning filesystems by default so recovery is only a moment away.

I haven't seen any replies in this thread along those lines - everyone is putting it firmly on the system administrator or the user. Would it be so hard during an install for a program to say "and now enter an encryption password and an ssh server address where I can backup to nightly"? If it's so simple you can script it yourself in a few minutes, isn't it so simple that many/most systems should come with that themselves?

It's long past the time where "computer lost my data" "well you've only yourself to blame" should be considered an old fashioned attitude.


This has to be tough. I run a small server that includes email hosting and an image gallery site that numerous people contribute to, and I get a lot of questions or complaints when it is down (usually bad network or a brief DC power outage). I have been using SSH and rsync for a long time to pull the contents of every important directory on that server to a local Solaris server running a RAIDZ pool and time slider (so I do get incremental backup).

I didn't actually lose data like Jeff, but the datacenter the server was in decided to kill the power to the machine 2 weeks before the scheduled date (poor processes and a move to a different building) and I didn't get any notification before it happened. It took another 3 weeks for them to ship my machine back to me. Becuase I had nightly backups I was able to restore email and the photo site to a new Linode instance in a few hours. Without those backups I would have been hurting bad.


/runs and backs up everything.


I've backups of my silly blog antirez.com, this guy ran a business out of his blog and don't have external backups. The data is small, it is easy to transfer. It's almost unbelievable to me.


Did he? I thought Jeff had some "real job" in addition to blogging and Stack Overflow. I honestly doubt an occasionally-updated programming blog was putting the food on his table.


For several years he made this post every working day:

  <p>inane comment showing <b>strong naivete</b></p>
  <blockquote>copypasta</blockquote>
  <img src=macro.jpeg>
  <blockquote>copypasta</blockquote>
  <p>Confident summary with <b>conclusive statement</b>
     displaying no comprehension of the quoted text
  </p>
  <p>Buy a Visual Studio Plugin from my sponsor!</p>


I remember he blogging about making more money from the blog than from the work at some time and deciding to blog full time.

I guess I can not link to the blog post...


> I guess I can not link to the blog post...

Well, I sort of can:

http://74.125.93.132/search?q=cache:9vLV7YkK14sJ:www.codingh...


In the Joel thread, there was mention of "Truly Great Programmers(TM)", what their qualities are and how to hire them. I don't know if folks were putting Jeff in that category or not. But as a skeptic of this True Greatness, I'm perhaps grasping at straws in saying "hey look, anyone can screw up, True Greatness, phooey".


In case you'd like to back up your own blog, I posted a four part series on how to do it with Amazon S3 and cron: http://techiferous.com/2009/11/getting-started-with-amazon-s...

This assumes you're running your blog on your own VPS.


Sounds like a good coding horror article.


Redundancy, people. All my really important data lives in at least three (and often more) of the following eight places: (1) my current MacBook Pro; (2) my legacy Linux box; (3) my external USB drive; (4) my personal (Slicehost) server; (5) Dropbox; (6) Mozy; (7) Heroku; (8) GitHub. Wow, that was even more than I expected. Did I mention redundancy?


"Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it ;)"

    Linus Tolvards
I guess people use Google's cache and archive.org these days :)


I think it's more of an observation about which of the two are _real_ men, and less an observation about Google.


His blog gets linked to a lot, for better or worse. That's a blow.


This link appears to no longer link to a relevant post :(...


Just my 2 cents here: I encrypt my daily sql backup before sending it in a couple of places. This way I can store it pretty much everywhere - one place is actually on a client's server. Aescrypt for encryption, ssh with expect for actual transfer. Oh, and watch out for expect timing out - it was a fun moment when backups grew beyond the default timeout and expect started truncating them. Yup, backups should be checked often.


  Do not use SSH with Expect.
  Do not use them here or there.
  You will not use them anywhere.
Not just because of how much expect sucks: SSH tools leave out a --password option for a reason! Use passphrase-less RSA keys with a restricted account.


Also, because expect(1) defeats the entire "remote shell" aspect of SSH, which lets me do this from my workstation:

  ssh server tar zc /srv/http > http-backup.tar.gz
Yes, that's a remote command whose output is piped to a local file. With imagination, ssh, and shell-fu, one can get very far with backup automation and testing.


Maybe he should've bought Microsoft Backup Server 2010


Looks like coding horror is now back online.


Maybe he should have asked a question of serverfault.com for the best way to backup his server?


I feel for him.

But I wonder if this eventually turns out as a case study for "How to backup an entire site using the archives and search engine caches"

I know it's hard; but everything is _there preserved_ anyway. So it _is possible_


I won't link it again in this thread, but check out Warrick. It's a tool to do just that.



The question is gone, apparently.

EDIT: Back at http://superuser.com/questions/82036/recovering-a-lost-websi...



Oh The Irony

This is really bad news for any service, but seriously... codinghorror.com (shakes head)

Rule 16. If you fail in epic proportions, it may just become a winning failure.


System administration is not "coding".


That's splitting hairs. What's a backup script? What's a restore if not a unit test?


What a nightmare, I hope he has a good writeup on the story when the site is back online. I'm sure there are some great lessons to come out of all this for the rest of us. Specifically, what were his expectations from the service providers he worked with.


Seems like there's room for a new startup for easy backups for shared webhosts.


how about a google search site:http://codinghorror.com/

and then looking at cached results?


AWESOME. Less trash on the Web :D


well its just his blogs, for a second i thought it was stackoverflow.com and the rest of his sites, which would have been funny.


I cannot imagine how horrible this must be. Oh, the horror!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: