Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is it necessary to backup S3?
26 points by dawie on Sept 14, 2009 | hide | past | web | favorite | 29 comments
I have a consumer Saas Web Application that uses S3 to store user's documents. Do you think it's necessary to backup the S3 data to Rackspace Cloud?

At the end of the day it's an economics question. How much might you lose in a worst case scenario if you failed to recover some vital data? How much extra will it cost to back up your data to a second place? If the answers to those questions are "everything" and "not a lot" then there is no reason not to backup your backups.

As a general rule important data should be backed up to (at least) two separate places, and in this scenario I'd consider S3 to be one place.


That can not be said often enough. We should make a list of IT disaster stories from people that did not take that piece of advice to heart.

If you are hacked, someone could get your S3 credentials and delete your content. You should backup.

It would be wonderful if S3 offered a write-only, no-overwrite mode so you could upload stuff from servers without storing credentials.

You can actually kind of do that. There's a system intended to allow random users credential-free upload access to your buckets via POST, but you can enforce policy control like "the target has to have xyz/ as a prefix" and "the size must be fewer than 100000 bytes". Here are the S3 docs for it: http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index....

On second glance, it doesn't support "no overwrite".

Ah, thanks, this is very useful.

It's necessary, because you never know when you'll accidentally corrupt all your data. Even if Amazon never fails, your application still can.

Yes, Sometimes the software tools you use can mess up. I lost many production EC2 images because of this unfixed bug in elasticfox:


(basically if you have similarly named images, they'll all get deleted with the "Delete AMI parts and deregister" feature)

You can put all the marketing spin on it that you want, but S3 is still just another webapp. Architected by humans, operated by humans.

If you think the humans at Amazon are somehow above making mistakes or bad decisions, then there is no need to backup your data.

If you don't fully trust the humans at Amazon, and what to be able to have access to your data, on your terms, at any time, you should back it up...

keep multiple snapshots so that you can recover files after you delete them yourself.








# colons in path confuse s3sync

DATE=$(date +%F_%H-%M-%S)

NEWEST=$(/bin/ls -r $DIR | grep -v current | head -1)

# copy the newest copy of files into a new directory, creating hardlinks instead of duplicate files

cp -al $DIR/$NEWEST/ $DIR/$DATE/

# sync the s3 bucket against the new directory

$S3SYNC --recursive --make-dirs --delete --no-md5 -v $BUCKET: $DIR/$DATE/ 2>&1 | grep -v 'Could not change owner'

# update the "current" symlink link

test -e $DIR/current && rm $DIR/current

ln -sf $DIR/$DATE $DIR/current

# remove snapshots older than DAYS_TO_KEEP_BACKUPS days

find $DIR -maxdepth 1 -type d -mtime +$DAYS_TO_KEEP_BACKUPS -print0 | xargs -0 --no-run-if-empty rm -rv

How would you feel if one day that data was lost ?

There are already documented cases of Amazon losing data in S3.

The Amazon S3 SLA does not cover data loss at all, only service unavailable situations, which they will cover by crediting you.



And even if all this was not the case you have a responsibility to your customers you can not outsource that responsibility.

I have never seen a documented case of S3 losing data. Please provide references.

I have recommended to some customers that they back up S3 data to the S3 service running in another region. The new export service also provides a way to get physical copies of your data but, depending on how much data you have, it might not be practical.

Come on...one of these is from 06, and it looks like it was totally dealt w/ by Amazon customer service, and the other one was from 07 and is a complaint about EC2 when it was still in beta.

Would love to see REAL documented cases of data loss on Amazon, and anything in the last year would be great too.

Not saying it's infallible, and you should always back up your data (Amazon can't prevent things like natural disasters - this is the point of backing up in general - you can't predict), but if you're going to act like you gave good examples, at least give good examples.

Look, all I'm saying is it happened before, Amazon makes absolutely no guarantees and it could very well happen again.

To categorically deny this because you think the cases are not 'good enough' is arguing that only when there is specific documentation about S3 losing data in the last couple of months or a year will convince you that Amazon S3 can indeed lose data. Even if Amazon S3 had never lost data before then there still would be no reason to assume that it could not happen.

S3 is made up from hardware and built by people. It can - and most likely will - fail again, it has already done so in the past. When the last case was is not really relevant, just like when the last earthquake was is not really relevant when you're living on a fault line.

Earthquakes - and data loss - are a fact of life in the IT business, you plan for them, or you weigh the economics of the risk and you decide that you can re-create your data at a lower cost than it will cost you to back it up over the average time to failure.

Amazon will not be able to magically recreate your data so if you have a business incentive to keep your data (such as a responsibility to third parties) then you should back it up.

It's that simple.

Oh, and regarding amazon customer service, note that it took them 11 days to pinpoint the fault, and customer data actually was lost.

Check Allans post at Jun 23, 2008 6:28 AM for a pretty good insight into how easy it is for S3 to break.

What also bothers me is that apparently all traffic for these customers was passing the same SPOF, a single load balancer.

Another thing to take home from this is to ALWAYS supply an MD5 of your data and keep an MD5 of what you sent.

Gmail, another example of a large body of data that end users have some attachement to has also occasionally lost data, see:

http://www.thebitguru.com/blog/view/252-Have you lost email on gmail

Sure, you could argue, gmail is not S3, but that is not relevant, the things they have in common (type of architecture, kind of hardware, run by very fallible people) are what matter.

As I said:

Not saying it's infallible, and you should always back up your data (Amazon can't prevent things like natural disasters - this is the point of backing up in general - you can't predict), but if you're going to act like you gave good examples, at least give good examples.

If that's what you call categorically denying that it can happen....

Again, please find a case in the last 2 years even.

I think we both agree you should back up your data, and as an IT policy it's obviously incorrect to ever think you're 100% safe, and if you use S3 you should still be redundant if you want to get closer to that 99.9% limit. But you'll never be 100% - that's life.

The only reason I defend S3 so heavily is that compared to the other options you'd be using instead of (or better: in concurrence with) S3, it's probably among the safest, data loss wise.

Sofar I have never heard about S3 actually losing data (apart from users erasing it). They store everything 3 times. But I would have a look at the cost/MB. Documents are usually small and important (thus, worth to backup). I have read that SmugMug, for example, stores it's files just on S3, with no backup.

apart from users erasing it

Very very important caveat, there. If you have three copies of the data, but all of them are in the same S3 account, a prankster who steals your S3 creds can delete all of them in about ten seconds.

Or, if you make a typing mistake, you can do that to yourself. Boy oh boy, will that be an unhappy day.

Diversify, diversify, diversify.

Very true. One instance I can think of is using the firefox addon S3Fox for managing S3. My co-worker came close to deleting a prod bucket with 80K+ images.

You should certainly backup your data. While the engineers at Amazon are certainly among the smartest, they are not infallible. Things do fail from time to time.

I lost 1TB of data several months ago due to some backend issues with EBS and S3. Fortunately for me, it was just a backup of a backup of a backup. ;-)

It's also worth pointing out that this might have been more of a failure of EBS (elastic block storage, an abstraction over S3) than with native S3 buckets.

Ultimately, my EBS device become unusable by any operating system, and Amazon support stated that the data was lost due to several backend systems failing.

At $0.15/GB it's a no-brainer for the piece of mind. I once heard of a case (not with Amazon) where data was replicated in 3 locations, and data got corrupted in one location and was copied to the others, resulting in corrupted data in all 3.

Standard response from anyone: Backup everything, just in case


fyi, you should take a look at cloudloop (www.cloudloop.com). it has a nice command line interface that lets you sync data across providers (s3, rackspace, nirvanix, azure, etc).

You can refer to this discussion for a tool for s3 backup


Remember that Amazon is not infallible. In fact, Amazon has had at least two S3 outages that resulted in permanent data-loss.

Would you put all of your eggs in a bookstore who just recently decided to become an egg storage vendor?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact