
Skype brief post-mortem - LiveTheDream
http://www.skype.com/content/skype/intl/en-us/StatusUpdate.html
======
terryjsmith
This seems like a nightmare situation for P2P company. Your super nodes get
knocked offline and there's no way to force update them until they come back
online, which is sporadic at best (based on my own experience today), which
means either manually updating them or completely re-seeding your network
(which seems to be the route they're going -- though so far it doesn't appear
to be going all that well).

As other people have stated on other articles posted here, no discredit to
them for having some downtime (I'm sure they're working round the clock on
it), but I have to wonder if this could have been avoided with a phased roll
out of new version to the super nodes.

~~~
dennisgorelik
Why is it a problem to bring supernode back online?

~~~
barrkel
I can imagine it would self-DDoS from lack of peer supernodes to share the
load.

~~~
dennisgorelik
But at least it would successfully reply to at least some of requests, right?

And would still keep trying to serve the requests, right?

So if sufficient number of supernodes is brought back online then the problems
should disappear.

~~~
terryjsmith
When this happens to our websites (when all servers go down), we need to rate
limit and/or shut down traffic at the load balancer level as we bring things
back online, otherwise everything just continues to get swamped and goes right
back down. This would be nearly impossible in a P2P network and coordinating
it between locations would be an even bigger nightmare. I imagine this is why
turning on an entirely new network is a more viable option for them.

~~~
dennisgorelik
Why would overloaded server go down? Shouldn't it simply stop serving incoming
requests if it's overloaded?

~~~
barrkel
Is there a meaningful distinction between a server that doesn't serve
requests, and a server that is down?

~~~
dennisgorelik
There is a meaningful distinction between a server that serves 1% of requests
and server that is down.

I assume that overloaded server still serves some requests. But my assumption
could be wrong, so if you have experience with that -- please share your
knowledge.

Another thing that baffles me: if I introduce new supernode that is sitting on
new IP address -- why would all the traffic suddenly hit that node?

Shouldn't it be just gradual increase in requests while more and more Skype
clients discover that new supernode?

~~~
moe
_I assume that overloaded server still serves some requests._

See <http://en.wikipedia.org/wiki/Thundering_herd_problem>

Many types of systems need a warm up period before they can realize their full
performance. In web applications a controlled warm up is often needed to prime
the caches. In P2P applications - which you can't easily "reboot" as a whole -
the restoration of a steady state can be much more complex.

 _Shouldn't it be just gradual increase in requests while more and more Skype
clients discover that new supernode?_

In theory, yes. In practice this seems to be a case of
<http://en.wikipedia.org/wiki/Cascading_failure>

The remaining supernodes either can't handle the aggregate load alone. Or they
are being overwhelmed because the re-connection attempts from clients are not
evenly distributed.

 _Shouldn't it be just gradual increase in requests while more and more Skype
clients discover that new supernode?_

In theory, yes. In practice there's probably a lot of
<http://en.wikipedia.org/wiki/Positive_feedback> and perhaps even
<http://en.wikipedia.org/wiki/Monster_wave> going on in the skype network
right now.

~~~
jemfinch
This has absolutely nothing to do with the thundering herd problem. Why are
you linking to seemingly random Wikipedia articles?

The GP's question is valid: there's no reason why server software should
_crash_ when overloaded instead of simply degrading service.

~~~
lusis
True, there's no reason but that doesn't mean it doesn't happen. Many people
(I would argue, rightly) equate degraded service to being out of service.

Made-up Scenario: My cluster can handle XXXk users with an SLA of YYms
response time. In degraded mode, I'm only handling XXk users with YYYYms
response times.

I'm not meeting my SLA for the remaining number of users so I am, in essence,
offline.

As to your specific point of "crashing", look at what happened with 37s.
Should the server have crashed? No but there was a bug. The reason you add
more capacity in the FIRST place is because the existing number of nodes
cannot handle the volume. Depending on any number of bugs, issues or
configuration your degraded capacity is for all intents and purposes
"crashed".

Made-up scenario #2: A single server in your apache configuration can handle
200k concurrent connections reliably with fast response times. Double that
load and response times are so long that various devices on the path are
timing out the connections as stale. Apache hasn't crashed but it's not really
doing anything.

Fast failure is an accepted best practice. Shit, it's baked into Erlang. Kill
the process, start a new one and move on. Depending on the nature of the
crash, you're doing nothing but churning processes and not actually servicing
requests.

The bigger problem is that people don't design for this type of scenario.
Static landing pages. Decoupled services instead of monolithic all-in-one
containers. Look at github. That's an awesome example of how to degrade
service during an outage. Only certain components are "crashed" because
everything is fairly decoupled.

Meanwhile there's a guy over here running 4 apps in the same tomcat container
that communicate over memory transport with each other or even if he had the
common sense to decouple each app, didn't bother to fail fast and was busy
spinning up threads trying to communicate with the rest of the services that
he can't actually respond to anything externally.

~~~
jemfinch
You're missing a fundamental difference between degraded service and being out
of service: if an overloaded server handles its maximum number of clients but
doesn't crash, then you have time to bring up additional servers to share the
load. If an overloaded server _crashes_ instead of limiting the number of
clients it serves, you have to staunch the flow of clients further upstream
before you can bring more servers up to share the load. This is what the GGGP
is saying, and it seems like everyone (including you) are missing his point.

Degrading service, or more accurately, serving the maximum number of clients
you can serve _and no more_ is absolutely not the same thing as being down.

Failing/crashing rather than limiting the number of clients you serve is
absolutely not established best practice. It's poor practice, in fact.

~~~
lusis
I'm approaching it from the business perspective. If sipport staff are getting
calls for poor performance then the product isn't working. End of story.
Business units don't care about shades of gray and degraded service. It's
binary for them. It's either working or it isn't. Couple that with financial
penalties for latency and degraded service means even less. Either you're
meeting your SLA or you aren't.

------
ddlatham
_Under normal circumstances, there are a large number of supernodes available.
Unfortunately, today, many of them were taken offline by a problem affecting
some versions of Skype._

So, what was the problem affecting versions of Skype?

~~~
AgentConundrum
That's what I've been wondering all day. I didn't have a problem with Skype
being down; I had a problem with Skype having exploded this morning, screaming
about uncaught exceptions as it did.

Your quote was an "aha" moment for me.

~~~
igravious
So did I, were you travelling at the time? I was in an airport and I figured
that it is the weird you have an internet connection but it only takes you to
a pay per usage landing page and no other page kind of problem that made Skype
barf. I happened to a Java network client too at the same time. I can send you
logs Skype!

~~~
AgentConundrum
Nope, not traveling. I was just sitting at home and it exploded.

------
swolchok
This isn't really a post-mortem; they're still in progress on a fix.

~~~
tommi
The title also refers the whole Skype being dead.

------
johndbritton
"Our engineers are creating new ‘mega-supernodes’ as fast as they can"

What a brilliant solution.

~~~
tommi
It's not a solution, it's a hot-fix.

~~~
dennisgorelik
It's a hotfix that does not work [yet].

------
bystanderr
Reading the discussion here reminded me of the outage of AT&T's long-distance
network back in 1991. After a bit of searching, I found an interesting-looking
document at <http://faqs.org/rfcs/rfc3439.html> entitled "RFC 3439 - Some
Internet Architectural Guidelines and Philosophy," which seems relevant here
somehow. Though this document outlines considerations pertaining to complexity
in Internet backbone architecture, perhaps the overall philosophical questions
it poses and guidelines it offers could be instructive as regards the recent
Skype meltdown.

As regards the aforementioned 1991 failure of the AT&T long-distance system,
which resulted in a service outage of about six hours, the document says:

"The PSTN's SS7 control network provides an interesting example of what can go
wrong with a tightly coupled complex system. Outages such as the well
publicized 1991 outage of AT&T's SS7 demonstrates the phenomenon: the outage
was caused by software bugs in the switches' crash recovery code. In this
case, one switch crashed due to a hardware glitch. When this switch came back
up, it (plus a reasonably probable timing event) caused its neighbors to crash
When the neighboring switches came back up, they caused their neighbors to
crash, and so on [NEUMANN] (the root cause turned out to be a misplaced
'break' statement; this is an excellent example of cross- layer coupling).
This phenomenon is similar to the phase-locking of weakly coupled oscillators,
in which random variations in sequence times plays an important role in system
stability [THOMPSON]."

------
gojomo
But _why_ did a "large umber of supernodes" today get "taken offline by a
problem affecting some versions of Skype"?

Synchronized bug in that software – some sort of clock overflow, or update
from Skype gone awry?

Or a flaw in that software discovered and exploited by others?

There must be a lot more to this story.

------
viraptor
This part is pretty cool, but quite subtle: "Earlier today, we noticed that
the number of people online on Skype was falling". I mean that they noticed
the number going down, not "people started raising tickets saying it doesn't
work".

------
car
How is this a post-mortem? The blog entry does not explain in any way what
really happened.

------
zinssmeister
does anyone know why skype picked German as the only translation of this
message?

~~~
mike4u2
Because Germans tend to complain the most if free stuff does not work ;-)

~~~
RP_Joe
Very funny.

------
anonymous246
Does anybody know if these supernodes are paid for by Skype or they're simply
users' computers being commandeered by Skype (after getting click-through
permission).

The explanation doesn't ring true to me because AFAIK supernodes are used only
if a direct P2P path cannot be established between caller and callee.

What we're observing is the inability to sign in and see contacts' status. I
always thought that was centralized. Are supernodes involved in signin?

~~~
oiuytriuytr
In theory they can be regular users computers - but unless you have the only
PC connected to some small outpost on the internet it's unlikely that it would
be you.

Generally it's either a skype or ISP owned machine

