
S3 glitch causes Tarsnap outage - cperciva
http://www.daemonology.net/blog/2010-09-17-S3-glitch-causes-Tarsnap-outage.html
======
Groxx
Nicely written... I don't think I've read a more reassuring outage message
before. Especially since it doesn't particularly sound like it's meant to be
placating, it's just informing.

~~~
cperciva
Thanks, but I wasn't trying to be reassuring, only to provide an explanation
of what happened.

~~~
uuilly
A good explanation is reassuring.

------
jacquesm
Interesting how a 404 basically now translates in to purge DNS cache and try
again.

Did I understand it right that the inability to locate a single object for a
single customer affected all your customers?

~~~
cperciva
_Did I understand it right that the inability to locate a single object for a
single customer affected all your customers?_

Almost right. The Tarsnap server aggregates together data from multiple
machines into a single S3 object; I don't know how many users had data stored
in that object, but it's probably more than 1 and less than 10.

But the problem wasn't really caused by the object going missing; rather, it
was caused by S3 doing Something Which S3 Should Never Do, plus the Tarsnap
server code not being designed with that possibility in mind. I've since
adjusted the code so that an error like this will be handled less severely.

(That said, I doubt I'll ever see this S3 glitch again -- I got a phone call
from Amazon providing some additional details about what caused this and it
was clear that they were taking it very seriously.)

~~~
jacquesm
> That said, I doubt I'll ever see this S3 glitch again -- I got a phone call
> from Amazon providing some additional details about what caused this and it
> was clear that they were taking it very seriously.

They'd better.

I'm as surprised as you are (well, probably not because I'm not using S3, but
I know a bit about how it is put together and I can't believe the Amazon
people are happy about having this happen to them, it's the exact opposite of
what should happen in a 'cloud' storage situation).

I think there will be some pretty high level meetings on this glitch, the one
thing you don't want is customer data going absent-without-leave, even if only
on a holiday instead of a permanent departure.

Isn't it against your instinct to have data from different customers live in
what amounts to be the same file? I understand you've got it encrypted to the
hilt but that seems 'un-Colinish' ;)

~~~
die_sekte
As far as I understand he is using S3 as a dumb block-level store. He has
implemented a file system on top of S3.

All customer data is stored in different files, it's just that those files
don't map 1:1 to S3 objects.

------
almost
Ouch, I could imagine that S3 glitch causing a serious problem. Not in Colin's
code I'm sure but maybe with someone a little less diligent. A 404 could cause
a system to assume something isn't there and maybe, shock horror, write
something else in its place. I can imagine that being very very bad... Still,
sounds like it won't be happening again.

------
hopeless
The kill-switch-on-error idea is very interesting and I can see why it might
be necessary for something like Tarsnap.

One question though: could this feature be used in a denial-of-service attack?
i.e., induce errors in the Tarsnap server or it's supporting environment (such
as DNS) so that it shuts down for everyone? Admittedly, there doesn't seem to
be much point in this but I'm curious if it's an angle you've considered.

~~~
cperciva
_could this feature be used in a denial-of-service attack? i.e., induce errors
in the Tarsnap server or it's supporting environment (such as DNS) so that it
shuts down for everyone?_

If you can impersonate an S3 server, yes.

But if you're impersonating an S3 server, I _want_ Tarsnap to shut down
pending investigation.

------
zackattack
incidentally, i have no idea why everyone treats s3 as so reliable. all these
services offer backups to s3. well folks, i think that s3 is bound to fail
some day and some good peace of mind could be manufactured by mirroring s3
files to a few different mirrors.

~~~
lsc
They sound like they have a pretty good system, and if you can afford five
petabytes and don't do much moving in and out, they are cheap, too.

On the other hand, their system is non-standard. It's not used by or tested by
anyone else. And it sounds like a fairly complex system. The fact that they
haven't lost a whole lot of data yet means that they must be pretty good...
but the more complex (and unusual) a system, the more I would fear a failure
caused by a software bug.

~~~
cperciva
_[S3] sounds like a fairly complex system_

That depends on your definition of "complex", of course. In terms of the
number of lines of code, I'd guess that S3 is significantly simpler than a
typical filesystem -- key-blob CRUD is a _much_ simpler thing to implement
than directory trees, file metadata, memory-mapped files, and all the other
horrible messes filesystems need to handle.

As far as data loss specifically goes: S3 is in a good place. Most of the
complexity of a system like S3 is in _finding_ data -- that is, routing
requests to the right node -- not in merely _not losing_ data. I don't think
it's a coincidence at all that despite several outages over the past few years
which have made S3 _unavailable_ , they've always been able to bring the
service back online with no loss of data.

~~~
lsc
You'd know much more about these sorts of things than I would, but most
filesystems largely ignore disk corruption. If amazon did that, we'd be seeing
a lot more trouble than we have. This is where I'm seeing the complexity;
detecting and recovering from subtle errors. I've gone a short way down that
path and from where I stand it looks quite hairy.

Besides, I've seen data loss caused by ext3 on my own systems that was not
caused by disk corruption, so really, at s3's scale, it's got to function (and
seems to have been functioning) much better than a regular filesystem.

~~~
cperciva
S3 has an advantage over regular filesystems there too. S3 doesn't need to
provide single-node durability as long as data loss events on different nodes
are uncorrelated; so they can protect data with cryptographic checksums and
throw it away (from an individual node) at the first sign of corruption,
knowing that they'll still have it on all the other nodes.

------
michaelhalligan
Such is the danger of sharecropping.

