

Why S3 went down - slackerIII
http://status.aws.amazon.com/s3-20080720.html

======
sgrove
I think amazon is approaching this in the right fashion - when you have a
major outage, show that you understand exactly what the issue is, and why it
won't happen in the future. That's quite a while to be down, but their
openness, and what I perceive as honesty, helps me trust that they know what
they're doing. And like mixmax said, their detailed description and timeline
scores them extra points, instead of the usual corporate nothingness. We're
still planning on moving over to "cloud-computing", or "infrastructure-
outsourcing", and AWS still seems to be the leader in the field.

------
mixmax
They get a lot of points in my book by providing a reasonably detailed
description of the event instead of delivering the usual non-informative yada-
yada.

------
mechanical_fish
_On Sunday, we saw a large number of servers that were spending almost all of
their time gossiping and a disproportionate amount of servers that had failed
while gossiping._

This is my favorite sentence of the day.

Now every time I see that damned whale on Twitter I'm going to find myself
involuntarily crying out, "Oh, no! Failed while Gossiping!"

~~~
alabut
The converse is true too, when Twitter is up and running just fine. I'm going
to think of that every time I tweet - I _succeeded_ at gossiping and failed at
working.

------
ROFISH
A single bit was flipped? Cosmic radiation comes back and us all in the rear!
(Of course there's more reasonable reasons, but cosmic radiation is always my
favorite. :D)

------
holygoat
This is exactly the kind of RFO I would like to get from our telephony
providers after an outage. Detailed, honest (no point lying -- you just had an
outage!), and taking steps to improve matters in the future.

Everyone has outages: it's what you learn from them that counts.

------
DougBTX
That's a great bug, very nice "illness"-like behaviour. The spread of a
corrupt bit, infecting and taking down the host.

~~~
jacobbijani
Yeah, it almost read like a description of a virus.

~~~
briansmith
That is why gossip protocols are also called "epidemic."

~~~
jacobbijani
Thanks, I meant to look up gossip protocols after reading the article but
forgot. Your comment reminded me.

~~~
slackerIII
Does anyone know if there is a "best" paper that describes gossip protocols?
They were mentioned in the Dynamo paper, but they didn't really get into them.

~~~
pageman
try:

[http://www.cs.cornell.edu/courses/cs614/2004sp/papers/BHO99....](http://www.cs.cornell.edu/courses/cs614/2004sp/papers/BHO99.pdf)

and

[http://www.cs.cornell.edu/projects/quicksilver/public_pdfs/B...](http://www.cs.cornell.edu/projects/quicksilver/public_pdfs/BringingAutonomic.pdf)

both by Ken Birman

~~~
slackerIII
Great! Thank you very much.

------
newton
Can I point out that this explanation did not identify a root cause? How was
the corrupt message originally produced?

If they know how it happened, it's not reflected in this article. The solution
they describe addresses the detection of and recovery from future mysterious
occurrences rather than identifying, understanding and eliminating whatever
bug or condition caused this one.

~~~
attack
Memory corruption.

~~~
jordyhoyt
Right, it likely wasn't a bug if it is a single bit and this only happens once
in a blue moon. A machine can only transmit perfect 1's and 0's for so long
before getting one wrong.

~~~
jrsims
But they're adding checksums right? Anyone see any holes in that?

~~~
jordyhoyt
Do you mean this?

> we're adding checksums to proactively detect corruption of system state
> messages

I dobut that means they're actually _adding_ them together, they're adding
checksums to the process to ensure data corruption has no effect. Or am I
misunderstanding you?

~~~
michaelneale
They use MD5 in other areas, but not that particular message. So now they
will. Do people call hashes "checksums" in a colloquial sense? no one actually
uses a "check sum" literally any more do they?

------
brfox
What was Amazon's response when their storefront went down a couple months
ago? Were they this upfront about it?

~~~
sh1mmer
I think this is possible with developer relations.

Trying this approach with consumers could backfire horribly if someone
misunderstands the details or tries to twist them out of context.

That's a lot hard to do with a developer audience.

------
charlesju
I think that while it is unfortunate that Amazon went down, they handled the
situation properly. I'm sure that within a couple of years they'll hammer out
all the kinks in the system.

------
sanj
"By 11:05am PDT, all server-to-server communication was stopped...

By 2:20pm PDT, we'd restored internal communication..."

I wonder why it took 3 hours to clear state.

~~~
cperciva
_I wonder why it took 3 hours to clear state._

It didn't -- it took 3 hours to get all the machines talking again after the
state was cleared.

This is probably mostly due to two factors: 1. When doing a "clean restart"
Amazon almost certainly had each S3 node look at the data it had stored to
make sure that they had correct metadata; and 2. After each node had been
restarted, Amazon probably had to relink them gradually rather than all at
once -- most self-organizing network protocols have limits on the rate at
which nodes can join or leave the network.

~~~
curiousgeorge
Data integrity testing?

~~~
cperciva
_Data integrity testing?_

This question no verb? Or rather... I have no clue what you're asking, can you
clarify?

~~~
DougBTX
The question mark isn't being used to end a question, he's using it to ask for
confirmation of a statement. He's writing in a way that mimics how people talk
to each when they meet face-to-face.

For example:

    
    
        Jack: I wonder what Jill's favourite flavour of ice cream is.
        Sam:  Chocolate?
        Jack: Yes, that's the one.

------
beaudeal
i agree with mixmax in that i really appreciate them being open and
forthcoming about the problems, and giving detailed explainations -- on the
flipside, im wondering what the outage will do to potential customers and
those who are currently using the service...its a bit scary to think that they
are hosting some very serious businesses and everyone went down for many hours

------
marijn
Downtime? It was only an 'Availability Event', thank you very much. But yes, I
like that they provide details.

------
gaika
That's why git is using checksums as commit labels. It forces consistency
checks at every step.

------
pjf
Is it only me reading the message as "we don't actually know the reason, but
in an act of desperation we rebooted all of our servers and hopefully this
wont happen again"?

~~~
kaens
FTA: "We use MD5 checksums throughout the system, for example, to prevent,
detect, and recover from corruption that can occur during receipt, storage,
and retrieval of customers' objects. However, we didn't have the same
protection in place to detect whether this particular internal state
information had been corrupted. As a result, when the corruption occurred, we
didn't detect it and it spread throughout the system causing the symptoms
described above."

------
drawkbox
Someone checked in an infinite recursive loop.

------
tdavis
_Loose lips sink ships!_

(sorry, I had to...)

