
Ask HN: Is it necessary to backup S3? - dawie
I have a consumer Saas Web Application that uses S3 to store user's documents. Do you think it's necessary to backup the S3 data to Rackspace Cloud?
======
dagw
At the end of the day it's an economics question. How much might you lose in a
worst case scenario if you failed to recover some vital data? How much extra
will it cost to back up your data to a second place? If the answers to those
questions are "everything" and "not a lot" then there is no reason not to
backup your backups.

As a general rule important data should be backed up to (at least) two
separate places, and in this scenario I'd consider S3 to be one place.

~~~
LogicHoleFlaw
AND PRACTICE YOUR RESTORES!

~~~
jacquesm
That can not be said often enough. We should make a list of IT disaster
stories from people that did not take that piece of advice to heart.

------
quellhorst
If you are hacked, someone could get your S3 credentials and delete your
content. You should backup.

~~~
idlewords
It would be wonderful if S3 offered a write-only, no-overwrite mode so you
could upload stuff from servers without storing credentials.

~~~
zachbeane
You can actually kind of do that. There's a system intended to allow random
users credential-free upload access to your buckets via POST, but you can
enforce policy control like "the target has to have xyz/ as a prefix" and "the
size must be fewer than 100000 bytes". Here are the S3 docs for it:
[http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index....](http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html?UsingHTTPPOST.html)

~~~
zachbeane
On second glance, it doesn't support "no overwrite".

------
chaosmachine
It's necessary, because you never know when you'll accidentally corrupt all
your data. Even if Amazon never fails, your application still can.

------
brk
You can put all the marketing spin on it that you want, but S3 is still just
another webapp. Architected by humans, operated by humans.

If you think the humans at Amazon are somehow above making mistakes or bad
decisions, then there is no need to backup your data.

If you don't fully trust the humans at Amazon, and what to be able to have
access to your data, on your terms, at any time, you should back it up...

------
jacquesm
How would you feel if one day that data was lost ?

There are already documented cases of Amazon losing data in S3.

The Amazon S3 SLA does not cover data loss at all, only service unavailable
situations, which they will cover by crediting you.

<http://aws.amazon.com/s3-sla/>

[http://www.datacenterknowledge.com/archives/2007/10/09/amazo...](http://www.datacenterknowledge.com/archives/2007/10/09/amazon-
offers-sla-for-s3-storage/)

And even if all this was not the case you have a responsibility to your
customers you can not outsource that responsibility.

~~~
garnaat
I have never seen a documented case of S3 losing data. Please provide
references.

I have recommended to some customers that they back up S3 data to the S3
service running in another region. The new export service also provides a way
to get physical copies of your data but, depending on how much data you have,
it might not be practical.

~~~
jacquesm
[http://developer.amazonwebservices.com/connect/thread.jspa?t...](http://developer.amazonwebservices.com/connect/thread.jspa?threadID=17246&tstart=0)

[http://developer.amazonwebservices.com/connect/thread.jspa?t...](http://developer.amazonwebservices.com/connect/thread.jspa?threadID=22709)

Do you need more or do you think that is enough like that ?

We've been here before by the way:

<http://news.ycombinator.com/item?id=528541>

~~~
wdewind
Come on...one of these is from 06, and it looks like it was totally dealt w/
by Amazon customer service, and the other one was from 07 and is a complaint
about EC2 when it was still in beta.

Would love to see REAL documented cases of data loss on Amazon, and anything
in the last year would be great too.

Not saying it's infallible, and you should always back up your data (Amazon
can't prevent things like natural disasters - this is the point of backing up
in general - you can't predict), but if you're going to act like you gave good
examples, at least give good examples.

~~~
jacquesm
Look, all I'm saying is it happened before, Amazon makes absolutely no
guarantees and it could very well happen again.

To categorically deny this because you think the cases are not 'good enough'
is arguing that only when there is specific documentation about S3 losing data
in the last couple of months or a year will convince you that Amazon S3 can
indeed lose data. Even if Amazon S3 had never lost data before then there
still would be no reason to assume that it could not happen.

S3 is made up from hardware and built by people. It can - and most likely will
- fail again, it has already done so in the past. When the last case was is
not really relevant, just like when the last earthquake was is not really
relevant when you're living on a fault line.

Earthquakes - and data loss - are a fact of life in the IT business, you plan
for them, or you weigh the economics of the risk and you decide that you can
re-create your data at a lower cost than it will cost you to back it up over
the average time to failure.

Amazon will not be able to magically recreate your data so if you have a
business incentive to keep your data (such as a responsibility to third
parties) then you should back it up.

It's that simple.

Oh, and regarding amazon customer service, note that it took them 11 days to
pinpoint the fault, and customer data actually was lost.

Check Allans post at Jun 23, 2008 6:28 AM for a pretty good insight into how
easy it is for S3 to break.

What also bothers me is that apparently all traffic for these customers was
passing the same SPOF, a single load balancer.

Another thing to take home from this is to ALWAYS supply an MD5 of your data
and keep an MD5 of what you sent.

Gmail, another example of a large body of data that end users have some
attachement to has also occasionally lost data, see:

<http://www.thebitguru.com/blog/view/252-Have> you lost email on gmail

Sure, you could argue, gmail is not S3, but that is not relevant, the things
they have in common (type of architecture, kind of hardware, run by very
fallible people) are what matter.

~~~
wdewind
As I said:

Not saying it's infallible, and you should always back up your data (Amazon
can't prevent things like natural disasters - this is the point of backing up
in general - you can't predict), but if you're going to act like you gave good
examples, at least give good examples.

If that's what you call categorically denying that it can happen....

Again, please find a case in the last 2 years even.

I think we both agree you should back up your data, and as an IT policy it's
obviously incorrect to ever think you're 100% safe, and if you use S3 you
should still be redundant if you want to get closer to that 99.9% limit. But
you'll never be 100% - that's life.

The only reason I defend S3 so heavily is that compared to the other options
you'd be using instead of (or better: in concurrence with) S3, it's probably
among the safest, data loss wise.

------
kierank
Yes, Sometimes the software tools you use can mess up. I lost many production
EC2 images because of this unfixed bug in elasticfox:

[http://sourceforge.net/tracker/index.php?func=detail&aid...](http://sourceforge.net/tracker/index.php?func=detail&aid=2835716&group_id=212540&atid=1022151)

(basically if you have similarly named images, they'll all get deleted with
the "Delete AMI parts and deregister" feature)

------
aolnerd
keep multiple snapshots so that you can recover files after you delete them
yourself.

#!/bin/sh

AWS_ACCESS_KEY_ID=<>

AWS_SECRET_ACCESS_KEY=<>

BUCKET=<>

DIR=<>

S3SYNC=/usr/bin/s3sync.rb

DAYS_TO_KEEP_BACKUPS=30

# colons in path confuse s3sync

DATE=$(date +%F_%H-%M-%S)

NEWEST=$(/bin/ls -r $DIR | grep -v current | head -1)

# copy the newest copy of files into a new directory, creating hardlinks
instead of duplicate files

cp -al $DIR/$NEWEST/ $DIR/$DATE/

# sync the s3 bucket against the new directory

$S3SYNC --recursive --make-dirs --delete --no-md5 -v $BUCKET: $DIR/$DATE/ 2>&1
| grep -v 'Could not change owner'

# update the "current" symlink link

test -e $DIR/current && rm $DIR/current

ln -sf $DIR/$DATE $DIR/current

# remove snapshots older than DAYS_TO_KEEP_BACKUPS days

find $DIR -maxdepth 1 -type d -mtime +$DAYS_TO_KEEP_BACKUPS -print0 | xargs -0
--no-run-if-empty rm -rv

------
PanMan
Sofar I have never heard about S3 actually losing data (apart from users
erasing it). They store everything 3 times. But I would have a look at the
cost/MB. Documents are usually small and important (thus, worth to backup). I
have read that SmugMug, for example, stores it's files just on S3, with no
backup.

~~~
mechanical_fish
_apart from users erasing it_

Very _very_ important caveat, there. If you have three copies of the data, but
all of them are in the same S3 account, a prankster who steals your S3 creds
can delete all of them in about ten seconds.

Or, if you make a typing mistake, you can do that to yourself. Boy oh boy,
will that be an unhappy day.

Diversify, diversify, diversify.

~~~
sbhat7
Very true. One instance I can think of is using the firefox addon S3Fox for
managing S3. My co-worker came close to deleting a prod bucket with 80K+
images.

------
mgorsuch
You should certainly backup your data. While the engineers at Amazon are
certainly among the smartest, they are not infallible. Things do fail from
time to time.

I lost 1TB of data several months ago due to some backend issues with EBS and
S3. Fortunately for me, it was just a backup of a backup of a backup. ;-)

~~~
mgorsuch
It's also worth pointing out that this might have been more of a failure of
EBS (elastic block storage, an abstraction over S3) than with native S3
buckets.

Ultimately, my EBS device become unusable by any operating system, and Amazon
support stated that the data was lost due to several backend systems failing.

------
byoung2
At $0.15/GB it's a no-brainer for the piece of mind. I once heard of a case
(not with Amazon) where data was replicated in 3 locations, and data got
corrupted in one location and was copied to the others, resulting in corrupted
data in all 3.

------
windsurfer
Standard response from anyone: Backup everything, _just in case_

<http://tvtropes.org/pmwiki/pmwiki.php/Main/CrazyPrepared>

------
lolido
fyi, you should take a look at cloudloop (www.cloudloop.com). it has a nice
command line interface that lets you sync data across providers (s3,
rackspace, nirvanix, azure, etc).

------
oscardelben
You can refer to this discussion for a tool for s3 backup

<http://news.ycombinator.com/item?id=577717>

------
dnsworks
Remember that Amazon is not infallible. In fact, Amazon has had at least two
S3 outages that resulted in permanent data-loss.

Would you put all of your eggs in a bookstore who just recently decided to
become an egg storage vendor?

