
Why RIM still hasn’t found the cause of its world-wide outage - shawndumas
http://arstechnica.com/business/news/2011/11/why-rim-still-hasnt-found-the-cause-of-its-world-wide-outage.ars
======
westajay
In other industries, such as petroleum refining, pipelines and nuclear power,
their are structured methodologies for determining root causes. Some of these
take into account equipment failure, and modelling is often done on equipment
life cycles to determine replacement and inspection schedules in oil
refineries.

These industries also employ strict management of change processes so that an
ad-hoc decision or improvisation doesn't cause an incident.

What puzzles me is why you don't see these kind of practices applied in data
center operations.

Are data centers really more complex.. then say.. a nuclear plant?

~~~
randomdata
_What puzzles me is why you don't see these kind of practices applied in data
center operations._

Simply, people are not willing to pay the price. All of that extra planning
requires more man hours. It also requires people, who come at significant
cost, skilled at working in such environments. People want to use their
Blackberries for hundreds of dollars up front and tens of dollars each month,
not thousands of dollars monthly.

It _can_ be done, but the market has determined that it does not want it to be
done. It would rather accept some downtime and other problems in order to
access the technology at a lower cost.

~~~
westajay
When I wrote my comment I wasn't thinking of lots of up-front planning. I was
thinking along more simple lines like root cause analysis using a human
factors or equipment taxonomy (much more affective then 5-whys).. and simple
logging of incidents for later analysis.

I think some of these kinds of processes can be adopted with small investments
in training and change.

Also, a lot of these kind of failures seem to stem from changes at the
networking layer.. which should be more planned and tested given their place
in the stack (we're not talking about crazy app behaviour).

------
serverascode
While I feel that saying that massive systems fail in unusual ways is
accurate, I also am concerned that RIM may not have the same technical prowess
as a google or even amazon and yet in a way they are competing at that scale.

Further, I guess pulling their users email into their system makes business
sense for them, but that has always struck me as an unusual technical choice.

~~~
randomdata
_Further, I guess pulling their users email into their system makes business
sense for them, but that has always struck me as an unusual technical choice._

I always thought it was so they could provide mail using a protocol that was
optimized for battery life. Constantly polling a mail provider for updates is
not exactly a good use of power.

They couldn't really expect everyone who runs a mail server to install the
necessary infrastructure to support the devices, so the next best thing is to
have computers operated by RIM attached to power mains worry about collecting
the email and then notify the wireless device only when there are changes.
Apple uses the same basic model for their push services.

~~~
wmf
Many customers already have BES installed. Why doesn't the phone talk to the
BES directly? What value is RIM's centralized data center adding?

------
recoiledsnake
The common trend among the outages(atleast Gmail, Amazon EC2) seems to be that
the queuing of the transaction messages seems to be overwhelming the network
thus further making the attempts at recovery futile.

Maybe a solution is to immediately stop all new requests and have servers that
record all the transactions to disk instead of redirecting them to overloaded
servers.

~~~
acqq
How can you make people stop sending messages?

How can "recording to disks" make some change? What do you think is being done
otherwise than writing some databases?

~~~
gwright
You have to build this sort of think into the infrastructure.

For example, TCP and Ethernet retransmission strategies and HTTP servers
should have reasonable timeouts that result in error responses rather than
just letting incomplete HTTP sessions pile up.

The same principles can and should be applied to any communication protocol.

