

CIO update: Post-mortem on the Skype outage - djipko
http://blogs.skype.com/en/2010/12/cio_update.html?cm_mmc=PXBL|0700_B6-_-downtime-20101229

======
teyc
There was a major cascading failure in the power grid a few years back.

I thought there was a case of Amazon outage attributed to the same class of
error.

The engineering trade-offs that are required are: 1) to protect the servers
themselves from being damaged 2) when servers go offline to protect
themselves, this may cause other servers to go offline. 3) to isolate the
failure to specific subgroups in a network. 4) to provide enough excess
capacity to take the load in the event of an outage

Bugs will occur, no matter how good the engineering is. Clients will need to
be smarter, for example - implement some kind of exponential back off
depending on whether the network is responsive or not.

------
comex
A very transparent, friendly attitude... from the same company who does this:
<http://kirils.org/skype/27c3.pdf>

I guess companies can have several faces, but it still strikes me as bizarre.

------
lukev
Very interesting. As with all outages of major services, it seems it started
with a confluence of independently minor, unforeseen events.

One question they didn't address, though, is if they're going to address the
core problem - that a positive feedback loop of overloading supernodes is
possible. It seems to me that a p2p system should be able to recover from
having 20% of its nodes taken offline, rather than spawning a full collapse.

Avoiding the scenario where 20% of supernodes go offline to begin with is of
course desirable, but since any number of things could cause that, it seems
like a genuinely resilient system should remain functional (even in a degraded
capacity) even if only a small fraction of nodes remains available.

~~~
toumhi
That's desired behavior, although much more difficult in practice than it is
in theory. If 20% of your network goes down and you can still serve clients
normally, it means that you have a big reserve of machines useful only in case
in big outages. I don't know if you can justify it economically.

You can also gracefully degrade performance, by rejecting client connections,
disconnecting progressively some clients, accepting loss of consistency etc.
It depends how far you can go without infuriating your customers.

We discovered that large-scale real-time systems(in our case, currently
400.000 concurrent connections) are really hard to stabilize against presence
storms, network problems and buggy clients, among others.

~~~
jpablo
_If 20% of your network goes down and you can still serve clients normally, it
means that you have a big reserve of machines useful only in case in big
outages. I don't know if you can justify it economically._

Just spin more EC2 instances ?

~~~
toumhi
Yes, if you use an elastic cloud, by all means, spin more instances :-) Most
existing companies still have real servers however.

~~~
bonzoesc
Most existing companies don't run P2P voice chat networks, either. Using EC2
or some other elastic cloud for emergency supernodes makes a lot of sense,
since they can outsource the risk of those machines sitting idle to Amazon.

------
teoruiz
I wonder how they managed to launch "several thousand more mega-supernodes
through the night".

I assume they can't turn regular Skype nodes into supernodes because they must
be reachable from a public IP address.

Where they using a cloud computer provider such as EC2?

~~~
JoachimSchipper
They mention using "group video chat servers".

