
Cisco’s Network Bugs Are Front and Center in Bankruptcy Fight - sytse
http://www.bloomberg.com/news/articles/2016-09-08/cisco-s-network-bugs-are-front-and-center-in-bankruptcy-fight
======
ericseppanen
5-nines uptime means anticipating that the components (hardware and software)
you bought from another company may fail in devious, correlated ways.

Failure to do so is simply incompetence.

Maybe the customer didn't read the fine print of their service guarantees, and
that's on them. I would hope that doesn't happen often-- it would be very
silly if service guarantees fell apart every time some piece of shoddy
equipment (purchased and operated by the service provider) turns out to be at
fault.

~~~
txutxu
This.

The higher SLA system that I've created, was for a military project.

Physical network layout: I did choose a double port star topology, this is,
every HP5300 modular swith, was connected to each other switch, with two
"teamed" ports.

Usually if you only do the connections, you got a network loop. But with STP
and VRRP in the HP 5300, I got an "always on" network.

The network did expand to +50 wifi access points with another proprietary wifi
controller which also did have the capacity to gracefully failover connections
from the APs, on network splits.

Servers behind the switches, where replicated in each segment. So you could at
any time turn off (in order, or cutting the power switch suddenly) any rack
(switch + servers) and the system did continue to work flawlessly. This was 5
identical racks/switches in star topology.

Acceptance tests did include to literally cut cables, literally turn off the
UPS and RACK power, etc.

We could upgrade any firmware (the HP5300 cabinet, their hot swapable modules,
or the servers BIOS/network-card/hard-disk firmware), without any service
loss.

I'm happy with that result.

I've to say, all my other projects where I've work (+15 years), didn't have
resources, neither did give any importance, to the needs of network firmware
upgrades or downtime (because of a bomb?). In most cases, it was not because
technical issues or handicaps, it was because of management.

Some projects did listen to me, and did contemplate the issue and planned it
as a "maintenance window", or as what today is called "immutable
infrastructure": prepare the new one, stop the service, replace, bring up the
service.

I never did upgrade a switch/router firmware at $job, without having a backup
switch ready and pre-configured, in case something went wrong. And preserve
the backup one for a prudential time.

In my "always on" military project, firmware did need to pass acceptance tests
in environments equal to production, before go to any production environment.

Edit: remove duplicated info

~~~
ChuckMcM
I remember explaining to a storage customer why they needed three very
expensive switches at the core of their network. One to handle the traffic,
another to handle the traffic when the first was busy, and a third which was
in the rack, patched and updated to one release _behind_ the main switches,
that could be swapped in by moving cables in under 5 minutes of operator time.

They initially thought I was kidding but then we did the fail tree on a white
board. It was an interesting experience for them, understanding what it took
to get what they took for granted.

~~~
jerf
You used the term "fail tree" which sounds interesting and I tried to learn
more on the Internet but hit problems. Do you mean fault tree?
[https://en.wikipedia.org/wiki/Fault_tree_analysis](https://en.wikipedia.org/wiki/Fault_tree_analysis)
If not, do you have a better search term and/or link I could follow up on?

~~~
ChuckMcM
Fault trees seem to be reasonably close. While developing my thinking and
understanding on reliability analysis I came at it from an analog of decision
trees. There were a couple of influential talks that got me started, one was
by Sandia labs discussing the ways in which they insure that nuclear devices
can't be detonated without approval (big requirement) through a process called
vulnerability analysis, and another on how Citibank worked their network to
insure uptime. The Chaos Monkey series from Netflix was quite fun as well.
Google did something similar internally with its DiRT exercises (Diaster
Recovery).

My "fail tree" (and I'll put it in quotes as unique to my conception of them)
analysis consists of identifying a failure, the system response to the
failure, a time to fix for the failed system, and a guess at the uncertainty
on the fix. So for example "switch hardware failure" is a failure, with a fix
time that varies based on "replacement part on hand" to "order/ship/install
(replace) the entire switch". The first order is failure/fix tree with
callouts of down time. The second order is mitigation/cost with mitigation
strategies and their cost resulting in a new call out of potential downtime,
and the third order is mitigation accelerators and their cost (which shorten
recovery to non-degraded mode) which affect cost and possible down time.

Much of that you can do on paper, but sometimes you will have to run
experiments to see how long things take to fix.

~~~
jerf
Thank you.

------
wbl
I'm surprised Peak Web isn't responsible for bugs in its suppliers stuff under
the terms of the contract. After all, it picked them, and can switch to
alternatives if they don't measure up, while the game company cannot.

~~~
mbreese
If they agreed to an uptime of 99.995% (~27 min/year), then that's on Peak
Web. It sounds like they oversold their capabilities. For that level of
uptime, they should have had much more redundancy in their engineering.

~~~
loeg
I furiously agree with you. Peak Web failed their SLA in the first month,
before the Cisco issue even cropped up:

> According to Machine Zone, the hosting service couldn’t make it a month
> without an outage lasting almost an hour. Another in August of that year was
> traced to faulty cables and cooling fans, according to the publisher.

------
peterwwillis
_" Three people familiar with Peak Web’s operations say the [10 hour] outage
gave the company time to deduce that the troublesome command was reducing the
switches’ available memory and causing them to crash."_

Cisco Nexus devices log alerts for low memory/resource situations to syslog,
the defaults being 85% minor, 90% severe, and 95% critical. Were they not
reading their logs?

------
jobes
You should be able to build a lot of redunandcy into your network on $4
million a month, but that would eat into margins, wouldn't it?

~~~
wmf
In networking, redundancy means two identical boxes with identical bugs.

~~~
X-Istence
There are very few companies that are willing to purchase network gear from
two different vendors and run two completely different network stacks.

Management would be a huge pain in the ass... but it would be doable.

~~~
walrus01
We are a large regional ASN in a tech heavy market and run 3 different vendors
of DWDM platform and both Cisco and juniper in the IP core. Diversity is good.

------
walrus01
Peakweb are idiots. With sufficient diversity and redundancy it is not that
hard to run a five nines ISP. Ten hour outage is amateur hour. There are a
great many terrible low budget hosting companies and colocation operators in
the market.

~~~
packetized
Five nines is five minutes and sixteen seconds of downtime per year. That is
an absurdly small amount of time, relative to an entire year. Saying it's not
hard demonstrates your ignorance in operating a network.

~~~
walrus01
You'd be surprised - I work for an ISP that sells five nines SLA on most of
its circuits, and meets it. We spend money to do it. Having 1+1 redundant core
and agg router sets at every POP, geographic diversity of inter-metro fiber,
full A and B side power/rectifier systems, etc.

With the right BGP and ospf design you can absolutely meet five nines
availability for an end user customer perspective. We have some places that
are approaching six nines.

This article reads like they had 1+0 everything and ran into a nasty iOS bug.
Running out of ram is amateur hour as well.

When we need to do customer service impacting maintenance that will totally
take their segment off the net, the hit can be from 15 seconds to a couple of
minutes. And that is in a case where a colo customer is single homed to a
single aggregation switch like one of our 48-port 10GbE aristas.

~~~
mjevans
Would you also encourage such a network be built with diversity of parts
suppliers? I ask because it seems like you'd want to be able to also resist
attackers exploiting a bug in a single vendor solution.

~~~
walrus01
There are a great many ISPs successfully using a mix of juniper, Cisco, arista
and other stuff. You will find extreme and foundry 1000baseT switches all over
the place still. I would never encourage a monoculture of one model nexus 3000
or anything similar to that.

~~~
pinewurst
There's a lot of CCIE types out there who feel they have to maintain a totally
Cisco shop regardless of equipment or diversity merits. This is finally
beginning to fade, but I've seen it more than a few times.

------
X-Istence
MachineZone shouldn't have put all of their eggs in a single basket. Multiple
hosting providers, and write your server side so that failover can happen to
other datacenters/locations almost instantly.

~~~
YesThat
Exactly this. The idea that MachineZone was willing to take a business that
they advertised during the Super Bowl
([https://www.youtube.com/watch?v=XkaWyrm8EQg](https://www.youtube.com/watch?v=XkaWyrm8EQg))
and put it all in the hands of a single ISP at a single facility is
questionable at best.

------
PaulHoule
My experience has been that Cisco hardware from the high end to the low end is
crap.

------
imglorp
> The entire network often has to go down in order to patch—very disruptive in
> the best of times

This is why Erlang has hot loaded code, since around 1986. Those who xxx are
doomed to repeat yyy....

~~~
tlb
Hot-loading code works in many languages. But it's not practical to hot-swap
hardware driver code, especially in order to fix a bug that initializes the
hardware wrong.

Also, the bug in question caused memory exhaustion, which isn't necessarily
easy to fix by hot-swapping code without some kind of restart.

------
wmccullough
Apparently, it's five-nines week on Hacker News.

