Coding Horror and blogs.stackoverflow.com experience "100% Data Loss"

nkohari · on Dec 11, 2009

To be fair, Atwood thought his hosting provider (CrystalTech) was backing up his system. As it turns out, their entire VM backup solution failed silently, so everyone thought the backups were being made.

If anything, I'd say this is a sign not to work with CrystalTech.

InclinedPlane · on Dec 11, 2009

Also to be fair the backup process Atwood uses for Stack Overflow proper (et al) appears to be much more robust (regular offsite backups, etc.)

The failures in this case appear to be:

1. Not taking the robustness of his blogs seriously enough (a LOT of people make this mistake, especially with their own content).

2. Being overly trusting of the procedures of his hosting provider. He thought he could trust his hosting provider's "backups" since they are a big company with lots of customers and he paid money for said backups, turns out he got ripped off.

3. Forgetting the maxim that you have to own your core competencies, and existence should always be a core competency.

pvg · on Dec 11, 2009

It doesn't seem to be particularly fair to the hosting provider, really, to put their name on the failpage and to tweet about how it's half their fault. Especially after you've advocated redundant backups, implied that you have them, written about the advantages of hosting your images on S3. When a mishap reveals that you've actually done none of these things, it is more than a little disingenuous to try to emphasize your hosting provider's relatively minor role in making you look foolish.

akkartik · on Dec 12, 2009

Relatively minor?! They failed to do what they said they would. They destroyed data and borked up the backups.

If it wasn't Jeff Atwood but me, would it be more their fault? Or would I just be less disingenuous?

Nobody's calling them names, or threatening to take business elsewhere or anything. But it's good to get sunlight in there, show them consequences, when they fail you. When a company providing you with service fails, you're allowed to scream it from the rooftops if you feel like it.

pvg · on Dec 12, 2009

Their role in making him look foolish (what I said, incidentally) is indeed quite minor. If he'd actually done the things he advocated and, frankly, implied that he'd done, he'd still have his images, his downtime would have been close to zero and he'd probably get to write a triumphant article about the value of following his own sage advice.

Someone else pointed out in another comment, there's a good analogy to be drawn to Atwood's own words here: http://www.codinghorror.com/blog/archives/001079.html

akkartik · on Dec 12, 2009

"Their role in making him look foolish (what I said, incidentally) is indeed quite minor."

Fair enough. But he's not complaining about looking foolish (nor is grandparent as far as I can see). He's complaining about losing data.

In that link he isn't saying "it's your fault no matter what the world throws at you, suck it up." He's saying, "look harder at yourself before you decide somebody else is to blame." Not an issue here; can we agree it is their fault?

pvg · on Dec 12, 2009

for the record, I blame 50% hosting provider, 50% myself

Is his tweet. Sounds like blaming to me. I suppose it's a matter of personal taste but if you write about how you're serving your images off S3 and how everyone should be making redundant backups and then get bitten by the fact you somehow wrote about these things but didn't actually do them, you're not looking very hard at yourself.

My simple point is - if you're an advice-giver who's been shown to not follow his own advice on backups, don't compound it by being an advice giver who doesn't follow his own advice on humility.

akkartik · on Dec 12, 2009

"Sounds like blaming to me."

I thought we were arguing about whether he cares about looking foolish.

Of course he's blaming them. There's reason to blame them. Regardless of what he did or didn't do, they screwed up. In fact I think he's being very restrained and humble taking 50% of the blame. I'd be fucking pissed if that happened to me. Even if I restored from backups in 30 minutes (extremely optimistic).

"if you're an advice-giver who's been shown to not follow his own advice on backups, don't compound it by being an advice giver who doesn't follow his own advice on humility."

Again, the humility he was referring to was to assume first that it was your fault until you determine otherwise for sure. What's that got to do with them screwing up?

You're basically saying: if I act holier-than-thou and advocate backups, then Slicehost or Amazon can give me crappy service because I'm not allowed to 'blame' them or mention how they suck. Which is really weird.

(I feel for the guy. I'm sitting here yakking because I can't connect to my EC2 slice for the last n hours.)

pvg · on Dec 12, 2009

Heh, we are absolutely going to validate the 'comment value drops exponentially with nesting' rule.

I didn't say he deserves crappy service or that his provider didn't screw up by firing his VM into Neptune. What I'm saying is (in 17 different ways, at this point) - if you are a vocal advocate but don't actually practice or believe what you preach, you're not really an advocate or useful commentator but a hack. When circumstances (that incidentally and in part happen to be someone else's fault) expose you as somewhat of a hack, it's a little uncouth to be pointing fingers at others for their much more minor screw-ups. Being a hack is bigger screw-up than accidentally (or through negligence) firing someone's VM into Neptune, that's all.

cag_ii · on Dec 11, 2009

I'd argue that the general idea expressed in the post below applies as much to system administration as it does to programming. Rule #1: Take responsibility.

http://74.125.93.132/search?q=cache:xGcmkwq9zUkJ:www.codingh...

DTrejo · on Dec 11, 2009

If anything, I'd say this is a sign not to work with CrystalTech.

After a mistake like this, it may be a sign that they will never mess up again.

potatolicious · on Dec 11, 2009

I disagree - it's a sign that they will never mess up their VM backup system again.

It's also a sign that they lack experience and competence. So while your odds of suffering this problem are significantly lower, your odds of the infinite number of other problems are still troubling.

nkohari · on Dec 11, 2009

We actually had our app hosted at CrystalTech until a couple months ago. Earlier this year, they had a critical power outage, which took our site offline. That's one of the reasons we're not there anymore.

klodolph · on Dec 11, 2009

Well, I'm a bit more cynical than you. However, Jack in the Box is probably one of the least likely chains to experience an E. Coli O157:H7 outbreak (since 1993, at least).

dnsworks · on Dec 11, 2009

If you fail to audit and test your disaster recovery procedures on a regular basis, then you fail at competently maintaining your infrastructure. No excuses.

nkohari · on Dec 11, 2009

I'm not disagreeing, I'm just saying that before people get drunk on schadenfreude they should put themselves in Atwood's position. If you pay someone for a service you generally expect it to work when you need it.

mcantelon · on Dec 11, 2009

Yeah, but if you pay for a mission critical service and never bother testing it, you've pretty much passively decided to fail.

viraptor · on Dec 11, 2009

Joining the chorus of other people saying "WRONG".

Reason 1: I pay for my bank account one way or another. It's bank's responsibility to keep the system running and secure, not mine. Sometimes we just can't do everything ourselves and have to rely on others.

Reason 2: Many of us host something somewhere. How many do backups? How many of us check the backups? How many do check the backups checking process? (enter infinite recursion) You have to stop at some stage. You probably don't have enough time, or your time is not worth enough to check that.

He can recover most of his blog from the caches, because it was quite popular. I bet he'll be back with an almost complete archive in less than a week.

nettdata · on Dec 12, 2009

You have the perfect right to just point and say "you were supposed to do that", while losing all of your data.

The rest of us will be quite happy to ensure that what we think is happening, actually is.

It's akin to a pedestrian getting hit by a car and then saying "I had the right of way". Yeah, maybe so, but now you're in the hospital breathing through a tube.

Those of us who checked for traffic before crossing the street are at home watching TV.

mcantelon · on Dec 11, 2009

The bank analogy is interesting and a good point, but the reality of ISPs is that they are less reliable, and less regulated, than banks.

If my income is dependent on data, I make sure it gets backed up. If it's an active project, I use my backups to build my dev environments so it's fairly obvious when a backup has failed.

thaumaturgy · on Dec 12, 2009

You have good habits, and that's sincerely commendable.

However, it really isn't fair to blame this kind of problem on the end user.

Are website developers also expected to keep the servers secure? If Apache isn't patched and up to date, is that the website admin's fault?

Service providers are paid to do a job. This one failed tremendously, and should lose a large chunk of business for it. I think Atwood is being far too kind in accepting 50% of the responsibility for the data loss.

To put it another way: the only reason to maintain your own backups of your site data -- aside from healthy paranoia -- is because you expect your service provider to fail at doing their job. And if that's the case, shouldn't you be finding a service provider that does it better?

antonovka2 · on Dec 12, 2009

To put it another way: the only reason to maintain your own backups of your site data -- aside from healthy paranoia -- is because you expect your service provider to fail at doing their job. And if that's the case, shouldn't you be finding a service provider that does it better?

If you don't expect your service provider to fail, you don't know anything about service providers.

Everybody fails.

Our colocation facility has redundant generator systems. They're tested regularly, and have handled failures previously. Yet, when the power went out, three of the backup generators failed, and our site (as well as Craigslist, Yelp, and others) was out for 45 minutes.

The cause? A bug in the backup generator's software: http://365main.com/status_update.html

Shit happens. Sometimes it's not your fault. You still need to prepare for it.

mcantelon · on Dec 12, 2009

>Are website developers also expected to keep the servers secure? If Apache isn't patched and up to date, is that the website admin's fault?

Getting hacked isn't as potentially catastrophic as not having backups. With backups, being hacked can be recovered from and the service provider changed.

>To put it another way: the only reason to maintain your own backups of your site data -- aside from healthy paranoia -- is because you expect your service provider to fail at doing their job.

A business doesn't expect its premises to burn down, but most have fire insurance in the event that this happens. Even if I don't expect my service provider to fail, there's no way to know that they won't and it makes sense to deal with this risk if the cost of dealing with it is reasonable and the cost of not dealing with it is catastrophic.

wvenable · on Dec 12, 2009

> It's bank's responsibility to keep the system running and secure, not mine.

As others have pointed out, it is your responsibility to check your bank statements to ensure that all is secure. But also, with banks you're dealing with money that is easily replaceable. If your card is compromised that money is gone but if it's their fault (or criminal) the bank will replace that money. Your data is not replaceable -- when it's gone, it's gone.

"Many of us host something somewhere. How many do backups? How many of us check the backups? How many do check the backups checking process?"

True you do have to stop somewhere -- but there is due diligence and there's negligence. Making sure you are making offsite backups and periodically testing them is due diligence. On the otherhand, completely relying on someone else to backup your critical data is negligence. If your data is at all important to you, then you need to have a copy. But at some point you have to accept that you've done enough.

antonovka2 · on Dec 12, 2009

Reason 1: I pay for my bank account one way or another. It's bank's responsibility to keep the system running and secure, not mine. Sometimes we just can't do everything ourselves and have to rely on others.

You are responsible for reading your statement and ensuring that all activity is valid, much in the same way that you are responsible for ensuring the viability of your recovery strategy.

Reason 2: Many of us host something somewhere. How many do backups?

Anyone who wants to keep their data badly enough to pay (time, money) to do so. Coding Horror is a popular technology blog, and there's a significant cost in traffic and credibility when it fails.

Jeff Atwood lacked cognizance of the risks and made an incredibly poor technical and business decision in failing to validate correctness of his backups and implement a suitable recovery strategy.

viraptor · on Dec 12, 2009

You cut a lot from reason 2. The was an important part. I've seen a system where the backups were made. They were verified too. Only after an actual data loss it was discovered that the backup verification was faulty and half of the "verified" and "properly backed up" data was missing.

My point is that you can spend lots of hours trying to backup your data and verify its correctness. But unless you put it back into the actual working environment and check every single bit of it, you cannot be sure it was a proper backup - not with 1GB of information and certainly not with 1TB. And then after you verified that you can verify that you have backups, something will fail and a bunch of people will say "if you didn't check the backups properly, it's your fault". I've seen backup systems fail in amazing ways and will probably never again believe that you can be "sure".

jacquesm · on Dec 12, 2009

It's interesting that you should mention banks. Banks routinely go under, and in almost all cases your deposits are only 'safe' up to a certain amount. Most data stores are worth well in excess of that amount, so if you could not reasonably expect a bank to hold on to a very large amount of data it stands to reason that you should not trust some service provider with data that has serious value without a secondary system in place.

After all, their liability will almost certainly be less than the value of your data, in which case there are two reasons to keep extra copies, both business continuity reasons and direct economic ones.

andrewvc · on Dec 12, 2009

1. Banks are more reliable than hosting providers. 2. For a small personal site you don't need anything fancy, just make sure you've SFTP'd the files to your disk and snapshotted the DB once every couple months. My personal site is backed up by Rackspace cloud, but I still do an occasional rsync just in case they fuck up.

svv · on Dec 13, 2009

Banks make a bad analogy.

Sure, it should be the bank's responsibility to keep accounts safe. However, the fact that bank accounts are considered to be "safe" is in a large part due to systemic factors like the existence of the Fed (as lender of the last resort), FDIC, general political climate of "too big too fail", etc. As mentioned above, banks actually do go under a lot; it's just not as noticeable to the clients as before [the central banks], but the cost is there. (And there's a whole discourse there of whether it's a good idea for monetary system to have this kind of environment in the long run.)

In any case, nothing of the sort applies to the hosting providers (or even can be applied, as insuring unique data is not the same as insuring amounts of money) -- so assuming that keeping data there is as safe as keeping money in the bank is not a viable comparison.

michaelneale · on Dec 13, 2009

I think there analogies for both sides of the argument. In the case of bank accounts (or bank security) I don't test it by trying to break in, check how it works with id theft etc ;)

But for backups, at least in this (early) era of manage hosting/cloud, its probably not too onerous to test your plans etc...

krschultz · on Dec 11, 2009

Is codinghorror really mission critical? He will restore from some other backups. Downtime on a blog is not that big a deal other than people like to hate on Jeff Atwood. Now if he lost customer credit card records or something like that, then it would be warranted.

houseabsolute · on Dec 11, 2009

Doesn't Jeff derive a significant portion of his personal income from the site? If so, I'd say mission critical is a good way to describe it.

dinkumthinkum · on Dec 19, 2009

If codinghorror somehow allows Jeff to live without working then I would say it is pretty mission critical. Mission critical doesn't mean the Earth will blow up or something like that.

gorm · on Dec 11, 2009

The trick is to simulate a data loss before it happens.

pyre · on Dec 11, 2009

The same could be said for CrystalTech

mcantelon · on Dec 11, 2009

I don't think anyone's saying that CrystalTech didn't also fail.

joe_the_user · on Dec 11, 2009

Hmm,

It's a lesson on what can and can't be "just bought".

Security and backups both need some top-down hands involvement. Also US recent experience dikes and air defenses are important too.

Confusion · on Dec 11, 2009

Do you also test that your airbags work properly? I don't audit and test the safety measures on most of the equipment in my house: that's what I pay the supplier for and there'll be hell to pay when they screw it up. Similarly, I don't audit and test my disaster recovery procedures, because that is what I pay my hosting partner for.

klodolph · on Dec 11, 2009

"Reasoning by analogy"

The people who made your car were legally required to test your model of car by crashing it into another one and making sure the airbags work. This is also too expensive to test yourself. One does not modify an airbag system at all without retesting it.

Your hosting provider has no legal requirement to test their backup system. At best, they have a contractual obligation. And they don't know how to test recovering your site, because testing whether it works is a different process for each site. Additionally, it's cheap to test backup procedures. Most people have a spare computer somewhere (maybe not in the data center), and it should only take a few hours to restore a copy of your site. Once a year. For Chris's sake, you could probably do it in the background and put a movie on.

nomoresecrets · on Dec 13, 2009

When you test airbags, you have to then replace them. This is possibly the worst car analogy ever :-)

brk · on Dec 11, 2009

Yes, but I'm not sure if your comment is targeted at CrystalTech or codinghorror?

You can't outsource liability.

If YOU are not testing YOUR backups. YOU fail.

jrockway · on Dec 11, 2009

It's a blog. How much time and money is worth spending on this?

(I backup my Slicehost account with their official service. If they die and my backup dies with it, that's fine; I don't have the time or money to do anything better.)

robryan · on Dec 11, 2009

I'd say a lot to Jeff, not only is it a popular blog standing alone and probably generates a decent income but it's also very important to him for cross promotion.

bad_user · on Dec 11, 2009

And do you think Jeff's blog just got less popular?

gaius · on Dec 11, 2009

He's not serving any AdWords off that page :-P

diN0bot · on Dec 11, 2009

his position as looking down on coding horror has somewhat moved to being part of the team. not necessarily a bad thing.

samdk · on Dec 11, 2009

You don't need to put much time or money into it. The bare minimum of a weekly cron-automated rsync job can be done in a few minutes, even if you've never done it before.

PStamatiou · on Dec 12, 2009

If I was in his shoes, yeah I'd be freaking out too and trying to get everything recovered ASAP. Then again my blog pays more than my (startup) job so that's why I care about it.

Confusion · on Dec 11, 2009

You can't outsource liability.

So you've tested whether your toaster has proper grounding and other safety precautions, in case it shortcircuits? I bet you haven't and that why you should stop repeating that stupid soundbite. We all 'outsource' liability all the time: we pay others to perform services for us and hold them responsible for the proper execution of those services. This includes hosting content and backing up that content.

brk · on Dec 11, 2009

So you've tested whether your toaster has proper grounding and other safety precautions, in case it shortcircuits?

Nope. I have a fully paid and verified home owners insurance policy though that covers any probable loss from a toaster fire.

It's sort of like having an offsite backup of my important personal stuff.

Your example is really not a parallel to a data backup. It's difficult to fully test all of the toasters failure modes in a non-destructive way. It's easy to setup an automated backup to a remote location. Given the gmail storage limits, you could tar and gzip your files and email your gmail account with the backup data.

A better example in a "homeowner" realm is the flood pans that you can buy to put under your water heater or washing machine. You'd like to be able to trust that the manufacturer made a waterproof device, but at the same time it's cheap and easy to insure yourself against the most common failure modes.

The sound bite is not stupid. People who believe you can outsource all the messyness of keeping a website alive are continually bitten in the ass by EC2, Rackspace, etc. failures.

sailormoon · on Dec 12, 2009

An insurance policy is equivalent to an offsite backup? So you don't have any personal items of any intrinsic value at all; you could lose it all, get a cheque in return, and be happy? Wow. Well, awesome, but I doubt that is common.

And look, I agree with you in many ways - people need to take responsibility for their own backups, sure. But there is a division of responsibility. I mean, even if all my backups are 100% perfect, I am still trusting the hosting provider to, well, keep the power on. Pay the peering bill. Keep the server temp down. Not go bankrupt tomorrow.

You can just follow this chain as far as you want. Whether you like it or not, you're utterly dependent on the DNS root server admins. There is nothing at all you can do to prepare yourself for their failure. I bet you can't generate your own electricity or grow your own food, either.

All of civilisation is built on co-dependency and delegation of responsibility. It's the only way to do anything complex. At some point, you must delegate. Atwood should have checked - but he was paying his host to do it. That's like having an employee whose job it is to do backups. At some point, you just have to let go and trust them. Otherwise you can never really do anything; you're caught up in checking minutia; I can give examples of this kind of leadership failure until my keyboard breaks.

leftnode · on Dec 11, 2009

1. Register an Amazon AWS S3 account - https://aws-portal.amazon.com/gp/aws/developer/registration/...

2. Download my S3 backup script (or anyone's S3 Backup script) - http://github.com/leftnode/S3-Backup

3. Set up a cron to push hourly/daily/whatever tar's of your vhost's directory to S3.

Spend, like, $10 a month. Thats 30gb of storage, 30gb up and 30gb down. Now, I know that may not be a lot, but I doubt codinghorror.com had that much data.

PStamatiou · on Dec 11, 2009

I do the same thing, with a script using ruby s3sync I wrote a while ago (that I should probably update):

http://paulstamatiou.com/how-to-bulletproof-server-backups-w...

ryanwaggoner · on Dec 11, 2009

I'm using your script (plus zipline and automysqlbackup) on several blogs. Thanks!

In case anyone else is interested: http://code.google.com/p/ziplinebackup/ https://sourceforge.net/projects/automysqlbackup/

Legion · on Dec 12, 2009

s3sync is great. I'm using it for automated backups for a number of work projects. Thank you for writing it and sharing it.

PStamatiou · on Dec 12, 2009

Oops I should clarify: "script using ruby s3sync I wrote a while ago"

I wrote a bash script that uses s3sync (not written by me). You can thank the s3sync community for that! :-) http://s3sync.net/wiki

kalid · on Dec 11, 2009

Just wanted to say thanks for the article -- I'm using your script for my blog :).

gfodor · on Dec 12, 2009

Thanks for s3sync!

mrduncan · on Dec 11, 2009

Step 4, most importantly, should be to verify those backups on a regular basis. All too often, people lose data due to a failure and then a backup which was failing or corrupted.

jhancock · on Dec 11, 2009

good outline. An alternative, which I use, is tarsnap http://www.tarsnap.com/.

mechanical_fish · on Dec 11, 2009

There's also tarsnap, run by HN's very own cperciva.

jrockway · on Dec 11, 2009

And Duplicity, which cperciva claims is theoretically less secure, but which seems to work pretty well for me. (Who cares if the NSA can decrypt my blog backups? They are already available unencrypted...)

cperciva · on Dec 12, 2009

Who cares if the NSA can decrypt my blog backups

Confidentiality is only one aspect of security. Authenticity is also important in some cases: You might care if the NSA edits your backups so that after restoring them it looks like you said something you didn't really say.

agb · on Dec 11, 2009

If you push your backups, make sure the account used does not have permissions to modify or delete the files it is creating offsite.

duskwuff · on Dec 12, 2009

Yes, this! Push backups are inherently risky -- if at all possible, backups should either be pulled by the target, or should be mediated by a third system. Otherwise, you risk an attacker deleting all your backups along with your data.

bprater · on Dec 11, 2009

Thanks for the tip. I've been doing the same thing Atwood has been doing. So I just installed your stuff and have it running.

A tip for folks using leftnode's setup: I didn't notice $gpgRecipient hidden at the end of the config file and chased around a bit looking for it.

leftnode · on Dec 11, 2009

Thanks for using it!

I use it to backup all of my databases and entire vhosts directory every hour and night.

Let me know if you find any bugs!

carbon8 · on Dec 11, 2009

I recently moved some servers over to this for encrypted daily s3 backups:

http://github.com/astrails/safe

Dead simple to specify what to backup, drop the config in /etc/safe and add a cron job.

idlewords · on Dec 11, 2009

"looks like it's 100% internet search caches for recovery. Any tips on recovering images, which typically aren't cached?"

This does not sound like the tweet of a man with backups.

joshwa · on Dec 11, 2009

"ugh, server failure at CrystalTech. And apparently their normal backup process silently fails at backing up VM images."

http://twitter.com/codinghorror/status/6573094832

scott_s · on Dec 11, 2009

It's not a backup if it's in the same location and managed by the same process.

EvilTrout · on Dec 11, 2009

Seriously, this. 1000 times this!

Why would anyone consider a backup on the same VM a backup at all?

All your data should be kept in two places, hopefully geographically disparate in case there's a building fire or horrible storm or what not.

You should also test it on a regular basis to make sure it exists (and so you know what to do when the shit hits the fan)

mrduncan · on Dec 11, 2009

Also, if you don't regularly test backups they might as well not exist.

Confusion · on Dec 11, 2009

If you regularly spend time testing backups, you should start paying someone to do that. For instance, the guys who make the backup. Who should provide that as part of the service. Which mean they are responsible for failures and can be held accountable.

mikeryan · on Dec 11, 2009

egads... methinks CrystalTech will be experiencing a mass exodus.

[Also makes me wonder how to test my host's backups]

brown9-2 · on Dec 11, 2009

[Also makes me wonder how to test my host's backups]

Make your own, don't rely on someone else.

gaius · on Dec 11, 2009

Straightforward enough. 1) Get another host, preferably one a thousand miles away from your present provider 2) Do what you need to do to get a copy of your site running there 3) Refine/automate the process and practice it 'til you can do it with your eyes closed

nollidge · on Dec 11, 2009

Wasn't he using Amazon S3 for images?

Cache of the relevant blog post: http://74.125.95.132/search?q=cache:yE0OdU1q6q4J:www.codingh...

scott_s · on Dec 11, 2009

Let me get this right: the guy who ran a site whose entire purpose is to make fun of other people's poor programming practices has no external backups?

nollidge · on Dec 11, 2009

I think you're thinking of Daily WTF. Jeff's blog is/was just a programming blog.

He also frequently said he was the world's worst coder, so...

pvg · on Dec 11, 2009

Well, he did make sure to tell us to shut up, stop what we're doing and make a back-up. Because he knows things, you see:

http://74.125.93.132/search?q=cache:2HHNAk2SB6EJ:www.codingh...

Legion · on Dec 12, 2009

Jamie Zawinski said to shut up, stop what you're doing and make a back-up. Jeff simply spread Jamie's message to a wider audience, rather than claiming it as his own.

vsync · on Dec 12, 2009

Amusingly, he also quoted jwz as saying "The universe tends toward maximum irony. Don't push it." This incident appears to be proof positive of that.

jrockway · on Dec 11, 2009

The site says he has a backup, right? So it sounds like he ... does have a backup.

pvg · on Dec 11, 2009

Yes, except he doesn't. Hence his tweets and his own recent question on stackoverflow about recovering his content from internet caches.

jrockway · on Dec 11, 2009

Ouch. I wanted to give him the benefit of the doubt, but I guess he is just fucked.

statictype · on Dec 12, 2009

Actually, it was JWZ who said that.

DougBTX · on Dec 11, 2009

Well, he was right.

scott_s · on Dec 11, 2009

You're right, I often confused Jeff's blog with Daily WTF. Jeff might be self-effacing, but at the same time he also dispenses a lot of programming and software development advice.

lukifer · on Dec 11, 2009

Conferring advice does not imply a claim of authority. If I give someone directions to a coffee shop in New York, the recipient should not infer that I am an expert on New York, traffic grids, coffee, etc.

telemachos · on Dec 12, 2009

If we are describing formal logic, then fine. (Although actually, not even fine there, really. There is a fairly powerful theory of "conversational implicature" that I studied once upon a time: http://plato.stanford.edu/entries/implicature/ It tries to tease out some of what makes conversation (which is gappy and missing a lot logically speaking) as effective as it is.)

If we are describing human conversation, then - um - I call bullshit. If you give advice, you most certainly imply (for a suitably non-formal definition of 'imply') authority, knowledge, etc. about the matter you give advice on.

Your example - or the way you spin it out - is very odd. If you give directions to some coffee shop, we should be able to assume that you know where the coffee shop is. I suppose at the very least we should be able to assume that you believe that you know where the coffee shop is. The guy gave advice on backing up - he passed along someone else's advice, and said he it was essential advice. The relevant switch for your analogy would be asking him to be able to describe the code in rsync or dd or whatever. Nobody is doing that. They're simply pointing out that he didn't practice what he preached (with various degrees of unfortunate serves-him-right schadenfreude, but never mind that for the moment).

Answering a question implies a claim of authority. Offering advice on the internet (unsolicited) is the equivalent of building a freaking billboard with directions to the coffee shop.

scott_s · on Dec 12, 2009

If, on the other hand, you set up a kiosk labeled "NYC Directions" on a corner, I'd assume you knew more about getting around the city than your average bear.

ube · on Dec 11, 2009

Well - he is human though he can have a new site now called "backup horrors". All kidding aside - this is horrible. How many people validate their backups on a regular basis? (even once a year)

dnsworks · on Dec 11, 2009

Automatic validation is done through every backup (and I have my guys do a review of the logs from the nightly cron procedures). Once a month we test our restore procedures.

gaius · on Dec 11, 2009

This is SOP in the real world.

pavs · on Dec 11, 2009

Yup jokes aside. If you are not doing external backups and checking it at least once a month, you are doing it wrong. You can blame your host as much as you want, its equally your fault.

I learned it the hard way.

gaius · on Dec 11, 2009

As I said in another thread, if you regularly refresh your development environment from your production backups, you get them verified "for free" and your ops people get plenty of practice, you never have to run a "DR exercise" to know you can.

whatusername · on Dec 12, 2009

No it's not.

It should be. It might even be in some of the plans. But most of the time it doesn't happen.

mcantelon · on Dec 11, 2009

Ditto. Also, for the stuff I don't automatically validate I run a script to check file sizes for sanity.

walkon · on Dec 11, 2009

I don't think that was the main purpose of his blog. I'd point you to some of his posts supporting this, but... :)

sjs382 · on Dec 11, 2009

"Coding Horror experienced 100% data loss at our hosting provider, CrystalTech.

I have some backups and I'll try to get it up and running ASAP!"

icey · on Dec 11, 2009

Read his newer tweets - his backups were just copies on the same server; they are gone as well, and now he's resorting to using internet caches to restore his content.

pronoiac · on Dec 11, 2009

There are tools to retrieve data from Internet caches, like Warrick - http://warrick.cs.odu.edu/warrick.html .

anigbrowl · on Dec 11, 2009

[facepalm]

jrockway · on Dec 11, 2009

Programming is not system administration. As a programmer, I learned long ago that I can pay people to keep systems up, and that they do better than me and cost less than me.

Handling a failing disk is the job of a sysadmin, not a programmer.

wvenable · on Dec 11, 2009

I've learned never to trust anyone else with your critical data. I try and avoid doing sysadmin tasks (I hate it) and I pay for it but I make sure the backups are working and at least some copy is under my direct control.

Sometimes I'm extremely shocked at how cavalier professionals are at maintaining your data and I've worked with plenty.

jhancock · on Dec 11, 2009

If your a one-man show, you have to develop enough sysadmin skills for critical tasks such as offsite backup. I do have sympathy for Jeff, I'm certainly not laughing at him. Hope this is a strong message to the many out there not doing it right.

gfunk911 · on Dec 11, 2009

I want to make a snarky comment, but I just feel for the guy.

jacquesm · on Dec 12, 2009

I feel for the guy too, but it really isn't the first time in the last couple of months that some service provider fails at their #1 stated goal: to keep your data safe.

And it isn't just the small ones either.

For the love of - insert your favorite deity here - please try to restore your backups, and try to do so on a regular basis. If not all you might have is the illusion of a backup.

It is a very easy trap to fall in to, and I really am happy this guy writes about it because it seems there are still people that feel that if their data is in the hands of third parties that it is safe.

That goes for your stuff on flickr, but it also goes for your google mail account and all those other 'safe' ways to store your data online. In the end you have to ask who suffers the biggest loss if your data goes to the great big back-up drive in the sky, the service provider or you. If the answer is 'you' then go and make that extra backup.

pstuart · on Dec 12, 2009

There's an added bonus to restoring backups on a regular (daily) basis: a one day old instance of the production environment is always available as a playground for training, dev or qa.

trickjarrett · on Dec 11, 2009

Too many of us have been there. Coincidentally I started my own backup process about 20 minutes ago before I saw this story.

mgorsuch · on Dec 12, 2009

I'm with ya. This sort of thing can sneak up on you.

jacobian · on Dec 11, 2009

"I had backups, mind you, but they were on the virtual machine itself" (http://twitter.com/codinghorror/status/6577510116).

Sigh. When we will ever learn?

skrenta · on Dec 12, 2009

I sent jeff tarballs from blekko's webcrawl for www.codinghorror.com, blog.stackoverflow.com, www.fakeplasticrock.com and haacked.com - about 6300 pages overall. He's got Coding Horror back up from the basic html.

Unfortunately we don't have the images, but it looks like most of the site is back up at least. It will probably be more work for him to re-integrate it into the cms though.

walkon · on Dec 11, 2009

Here's Google's cached version of a post of his about backup strategies:

http://74.125.95.132/search?q=cache:2HHNAk2SB6EJ:www.codingh...

It ends in what is now irony:

"If backing up your data sounds like a hassle, that's because it is. Shut up. I know things. You will listen to me. Do it anyway."

natrius · on Dec 11, 2009

The part you quoted is a quote itself, which could change the meaning for readers like me.

walkon · on Dec 11, 2009

The second sentence was. He quoted it earlier and repeated it at then end as a clear endorsement and agreement of the statement.

synnik · on Dec 11, 2009

I sympathize, but I have little patience for calling out your data provider when disaster strikes. Sure, there is likely some technical fault on their end, but ultimately you are accountable for your own site.

I'm not going to imply anything negative about him for losing the site temporarily. We all have "learning experiences"...

I just hope that when he does come back, he posts an insightful analysis of how he could have done more for his own reliability, rather than point fingers at a vendor.

akeefer · on Dec 11, 2009

How is it pointing fingers if your hosting provider lost your site plus their backups of your site? They were being paid to provide a service, and they failed at it in basically the worst possible way (i.e. total data loss, rather than just downtime). Sure, he should have had his own backups, but how does the fact that he could have recovered better form his host's screw up change the fact that they screwed up?

That's not too far from saying that it's your fault if you get hit by a car while walking across a crosswalk because you didn't jump out of the way fast enough.

They failed, they failed in a catastrophic way, and they deserve to have it made public knowledge and to lose business over it. He should do a better job backing up his own stuff, but he's right to be angry with his hosting provider and to call them out.

synnik · on Dec 11, 2009

He is right to be angry, but I have to disagree about calling them out while the site is still done.

An analysis posted after the fact can lay out where the technical failures took place, and that is the right time to describe issues with the host.

Pointing fingers in anger is very frowned upon in all organizations I work with. It implies a lack of ownership of your own products and a lack of maturity in handling your business affairs. Organizations that pursue blame before solutions do not have a positive culture -- they have a fear-based culture.

To answer your direct question, nothing he says will change the fact that the vendor screwed up. Your analogy is correct that it is the drivers fault if someone hits you, but flawed in that you cannot absolve all responsibility for your own safety when crossing streets.

When a vendor screws up, the level of professionalism that you portray when dealing with the situation says a great deal about yourself.

akeefer · on Dec 11, 2009

Point taken regarding calling them out before the situation is resolved: it would be far more respectful to allow the provider a reasonable amount of time to correct the error before explicitly attacking them. In this situation, I can understand it a little more, as recovering the backups from various internet caches might seem like a time-critical operation, and it would be difficult to ask for help with that without some explanation of what's happened. That explanation certainly could be fairly vague, though. Once the situation is resolved, though, if the resolution is unsatisfactory I think it's fair to say so.

I really took issue with your initial comment because "pointing the finger" has some connotation of assigning blame unfairly or unreasonably; I 100% agree that companies with a culture of blame are poisonous, and that people should err on the side of accepting too much responsibility rather than too little, and do their best to not pass blame on to other people.

There's certainly a line, however, across which I think it's reasonable to call someone else out. Where that line is depends on the situation and your relationship: the bar for doing it within your team is astronomically high (you should basically always deal with those things internally), within even the same company is still incredibly high (likewise), but it's lower when it comes to vendor relationships. Where you draw the line is probably different from where I draw it.

So while I agree that he should have waited, I don't think that publicly expressing his anger with his hosting provider after the fact would count as "pointing the finger" or "passing the buck" or otherwise indicative of a lack of personal responsibility; to me it would be understandable frustration and anger out of having been so dramatically let down by a third-party you were contracting with. And honestly, that sort of negative public publicity is one of the strongest checks we have on companies, be they hosting companies or retail stores or any other type of establishment.

More to the point: even if he did have backups, if they really lost his data and were unable to recover it themselves, I think he'd still be justified in outing their failure publicly after the fact. But again, you're definitely right that he should have waited and given the host a chance to resolve the issue before saying anything.

wingo · on Dec 11, 2009

It's not the sort of thing I like to read, but still, it makes me cringe in sympathy.

/me does a git pull from his weblog

LargeWu · on Dec 11, 2009

The next Stack Overflow podcast should be quite interesting...

apowell · on Dec 11, 2009

This is why I like restoring my backup slice every so often, just to make sure it's there. A couple years ago I had to use archive.org as my external backup, and it wasn't fun.

jsteele · on Dec 11, 2009

http://web.archive.org/web/*/http://codinghorror.com/

--PidGin128 via bmn

ismarc · on Dec 11, 2009

Looks like someone changed something near the end of June '08 that prevented archive.org from storing the site. Gotta suck.

andrewcooke · on Dec 11, 2009

they have a lag (it's supposed to be 6 months, but they're trailing that at the moment).

ismarc · on Dec 11, 2009

Yeah, I knew about the lag, but 1 1/2 years of lag seemed to signify other issues to me.

andrewcooke · on Dec 18, 2009

another data point here (i have been wondering about what is happening ever since this thread) - i just noticed that archive.org's bot has crawled my web site in the last month (or, at least, something identifying itself as such in the logs).

jodrellblank · on Dec 12, 2009

When people forget to save a document and close the program, among the "ha ha I'm so superior" replies are a few people advocating unpopular niche systems which are nice enough to autosave.

When people make mistakes and save them over the top, there among the smug laughter are a few people on the user's side reminiscing about long forgotten systems of old which have versioning filesystems by default so recovery is only a moment away.

I haven't seen any replies in this thread along those lines - everyone is putting it firmly on the system administrator or the user. Would it be so hard during an install for a program to say "and now enter an encryption password and an ssh server address where I can backup to nightly"? If it's so simple you can script it yourself in a few minutes, isn't it so simple that many/most systems should come with that themselves?

It's long past the time where "computer lost my data" "well you've only yourself to blame" should be considered an old fashioned attitude.

scblock · on Dec 11, 2009

This has to be tough. I run a small server that includes email hosting and an image gallery site that numerous people contribute to, and I get a lot of questions or complaints when it is down (usually bad network or a brief DC power outage). I have been using SSH and rsync for a long time to pull the contents of every important directory on that server to a local Solaris server running a RAIDZ pool and time slider (so I do get incremental backup).

I didn't actually lose data like Jeff, but the datacenter the server was in decided to kill the power to the machine 2 weeks before the scheduled date (poor processes and a move to a different building) and I didn't get any notification before it happened. It took another 3 weeks for them to ship my machine back to me. Becuase I had nightly backups I was able to restore email and the photo site to a new Linode instance in a few hours. Without those backups I would have been hurting bad.

pavs · on Dec 11, 2009

/runs and backs up everything.

antirez · on Dec 11, 2009

I've backups of my silly blog antirez.com, this guy ran a business out of his blog and don't have external backups. The data is small, it is easy to transfer. It's almost unbelievable to me.

jrockway · on Dec 11, 2009

Did he? I thought Jeff had some "real job" in addition to blogging and Stack Overflow. I honestly doubt an occasionally-updated programming blog was putting the food on his table.

blasdel · on Dec 12, 2009

For several years he made this post every working day:

  <p>inane comment showing <b>strong naivete</b></p>
  <blockquote>copypasta</blockquote>
  <img src=macro.jpeg>
  <blockquote>copypasta</blockquote>
  <p>Confident summary with <b>conclusive statement</b>
     displaying no comprehension of the quoted text
  </p>
  <p>Buy a Visual Studio Plugin from my sponsor!</p>

antirez · on Dec 11, 2009

I remember he blogging about making more money from the blog than from the work at some time and deciding to blog full time.

I guess I can not link to the blog post...

xatax · on Dec 11, 2009

> I guess I can not link to the blog post...

Well, I sort of can:

http://74.125.93.132/search?q=cache:9vLV7YkK14sJ:www.codingh...

joe_the_user · on Dec 11, 2009

In the Joel thread, there was mention of "Truly Great Programmers(TM)", what their qualities are and how to hire them. I don't know if folks were putting Jeff in that category or not. But as a skeptic of this True Greatness, I'm perhaps grasping at straws in saying "hey look, anyone can screw up, True Greatness, phooey".

techiferous · on Dec 11, 2009

In case you'd like to back up your own blog, I posted a four part series on how to do it with Amazon S3 and cron: http://techiferous.com/2009/11/getting-started-with-amazon-s...

This assumes you're running your blog on your own VPS.

chewbranca · on Dec 11, 2009

Sounds like a good coding horror article.

mhartl · on Dec 11, 2009

Redundancy, people. All my really important data lives in at least three (and often more) of the following eight places: (1) my current MacBook Pro; (2) my legacy Linux box; (3) my external USB drive; (4) my personal (Slicehost) server; (5) Dropbox; (6) Mozy; (7) Heroku; (8) GitHub. Wow, that was even more than I expected. Did I mention redundancy?

bad_user · on Dec 11, 2009

"Only wimps use tape backup: _real_ men just upload their important stuff on ftp, and let the rest of the world mirror it ;)"

    Linus Tolvards

I guess people use Google's cache and archive.org these days :)

klodolph · on Dec 11, 2009

I think it's more of an observation about which of the two are _real_ men, and less an observation about Google.

mncaudill · on Dec 11, 2009

His blog gets linked to a lot, for better or worse. That's a blow.

blhack · on Dec 11, 2009

This link appears to no longer link to a relevant post :(...

radu_floricica · on Dec 11, 2009

Just my 2 cents here: I encrypt my daily sql backup before sending it in a couple of places. This way I can store it pretty much everywhere - one place is actually on a client's server. Aescrypt for encryption, ssh with expect for actual transfer. Oh, and watch out for expect timing out - it was a fun moment when backups grew beyond the default timeout and expect started truncating them. Yup, backups should be checked often.

blasdel · on Dec 12, 2009

  Do not use SSH with Expect.
  Do not use them here or there.
  You will not use them anywhere.

Not just because of how much expect sucks: SSH tools leave out a --password option for a reason! Use passphrase-less RSA keys with a restricted account.

_phred · on Dec 12, 2009

Also, because expect(1) defeats the entire "remote shell" aspect of SSH, which lets me do this from my workstation:

  ssh server tar zc /srv/http > http-backup.tar.gz

Yes, that's a remote command whose output is piped to a local file. With imagination, ssh, and shell-fu, one can get very far with backup automation and testing.

jcapote · on Dec 12, 2009

Maybe he should've bought Microsoft Backup Server 2010

thechangelog · on Dec 11, 2009

Looks like coding horror is now back online.

jonknee · on Dec 11, 2009

Maybe he should have asked a question of serverfault.com for the best way to backup his server?

scorpion032 · on Dec 11, 2009

I feel for him.

But I wonder if this eventually turns out as a case study for "How to backup an entire site using the archives and search engine caches"

I know it's hard; but everything is _there preserved_ anyway. So it _is possible_

pronoiac · on Dec 11, 2009

I won't link it again in this thread, but check out Warrick. It's a tool to do just that.

kolosy · on Dec 11, 2009

well that didn't take long...

http://stackoverflow.com/questions/1890914/how-do-i-backup-a...

MHordecki · on Dec 11, 2009

The question is gone, apparently.

EDIT: Back at http://superuser.com/questions/82036/recovering-a-lost-websi...

trs81 · on Dec 12, 2009

he's back: http://twitter.com/codinghorror/statuses/6581734847

janitha · on Dec 11, 2009

Oh The Irony

This is really bad news for any service, but seriously... codinghorror.com (shakes head)

Rule 16. If you fail in epic proportions, it may just become a winning failure.

jrockway · on Dec 11, 2009

System administration is not "coding".

gaius · on Dec 11, 2009

That's splitting hairs. What's a backup script? What's a restore if not a unit test?

clistctrl · on Dec 11, 2009

What a nightmare, I hope he has a good writeup on the story when the site is back online. I'm sure there are some great lessons to come out of all this for the rest of us. Specifically, what were his expectations from the service providers he worked with.

zackattack · on Dec 12, 2009

Seems like there's room for a new startup for easy backups for shared webhosts.

Methos · on Dec 11, 2009

how about a google search site:http://codinghorror.com/

and then looking at cached results?

_csoo · on Dec 11, 2009

AWESOME. Less trash on the Web :D

heresy · on Dec 11, 2009

well its just his blogs, for a second i thought it was stackoverflow.com and the rest of his sites, which would have been funny.

bham · on Dec 11, 2009

I cannot imagine how horrible this must be. Oh, the horror!