

The network is reliable - aphyr
http://aphyr.com/posts/288-the-network-is-reliable

======
Dave_Rosenthal
Disclaimer: I'm co-founder of FoundationDB.

This is a great blog post. It shows all of the weird faults that motivate the
CAP theorem's relevance to real world distributed systems. I especially like
the part about the NIC "dropping inbound but not outbound packets". Good
times!

I want to highlight a conceptual jump that I think some people will make when
reading this article that is incorrect. That jump is: "Because network
partitions occur in the real-world, then we should choose CAP availability."
In fact, what we actually need is systems that deal gracefully with
partitions, regardless of whether they are CP or AP. An AP system needs to
"reconverge" in a timely way, avoid long latencies, etc. A CP system needs to
avoid split-brain scenarios, remain up despite some nodes being partitioned
away, etc. Mostly, either type of system needs relentless testing that seems
to be dangerously uncommon.

~~~
ryanobjc
I'd also like to point out that the CAP proof doesn't really talk about "split
brain mode"... it talks about "partitions" which are narrowly defined in the
network model.

So even if one is "P"-protected in the "CAP" theorem sense, you might not be
able to handle a full split brain partition.

~~~
aphyr
_So even if one is "P"-protected in the "CAP" theorem sense, you might not be
able to handle a full split brain partition._

I don't think so. Split-brain (assuming you're talking about totally isolated
network components, and not "disagreement about authoritative state between
nodes") is a special case of a partition: any partition-tolerant system is
tolerant of totally isolating partitions by definition. It might do that
either by entering split-brain and becoming non-linearizable, or by refusing
requests.

~~~
ryanobjc
"by refusing requests"

So any system that can handle a split brain by crash failing successfully
handles the situation?

The definitions here are so broad, and doesnt really match my experience,
which is namely detecting a fully split brain is difficult for humans to do,
harder for systems, and generally causes major problems.

Over-provisioning your system is one solution (eg dynamo with N=5), but that
isnt acceptable to everyone...

~~~
aphyr
_So any system that can handle a split brain by crash failing successfully
handles the situation?_

We seem to disagree about what "availability" means in the context of CAP.
Nothing has to crash. All that a serializable system has to do under partition
is refuse to respond to at least one possible request. Exactly which requests
is algorithm-dependent.

For example, a system with e3PC quorum reads and writes will handle partitions
by refusing to allow reads or writes in any minority component. If it writes
to all nodes and can read from any node, it will refuse to accept any writes
during partition, but all nodes can service read requests consistently.
Conversely, if it writes to any node and reads from all, it can _always_
handle write requests, but reads will fail.

Pragmatically, the desire for single-node-failure tolerance leads most CP
systems to accept operations only when a majority of nodes are accessible.
Systems like Paxos, Viewstamped Replication, ZAB, and Raft provide CP
semantics in exactly this way: a partition which divides a 5-node cluster into
a 3-node component and a 2-node component will continue to work correctly in
the 3-node component, and the 2-node component will refuse requests until the
partition is resolved.

Not all systems will shut down a component completely. Most sharded CP
databases will run, say, Paxos quorums between groups of three or five
replicas, but overlap those replicas around the hash ring. This is the design
described by Eric Brewer for Stasis, and is the planned design for Cassandra's
CaS and Riak's CP write support. These systems offer serializability over any
particular key (or shard, e.g. a single quorum), but not _between_ keys.
During failure, some keys will be consistently accessible in one component,
and a disjoint set of keys will be accessible in another.

 _The definitions here are so broad, and doesnt really match my experience,
which is namely detecting a fully split brain is difficult for humans to do,
harder for systems, and generally causes major problems._

You are correct--designing CP algorithms is difficult. Luckily, you don't have
to, because we have several already. You might start with

[https://bitbucket.org/cometpeak/ringmaster/src/1e9fc082d1e5/...](https://bitbucket.org/cometpeak/ringmaster/src/1e9fc082d1e5/doc/lit/zab.pdf)

[http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F...](http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F/notes/paxos-
simple.pdf)

<http://pmg.csail.mit.edu/papers/vr-revisited.pdf>

[https://ramcloud.stanford.edu/wiki/download/attachments/1137...](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)

------
IgorPartola
I have seen some really interesting network partitions between SoftLayer data
centers. The scenarios I saw were actually very frustrating because these
weren't complete partitions (in the CAP theory sense), but just really high
latency between data centers that was very sporadic. The CAP theorem addresses
the situation where there is 100% packet loss both ways, but you can get
"decreased availability" and "decreased consistency" with high latency. In
other words, if you plan on traversing the WAN with your system, your system
should take into account the high latency situation as well as classical split
brain.

P.S.: at some point we stopped having these issues and overall I think
SoftLayer is a great provider, at least for their bare metal offerings.

------
bonzoesc
This is a pretty exhaustive list of network issues and the software/data
problems they cause. Thanks!

------
ryanobjc
Interesting story, the Amazon datacenters were built in such a way they were
reliant on WAN links between them to act as one.

So take your average datacenter, let's say it has 60,000 machines, and now you
put it in to 3 datacenters of 20,000 machines each. For reliability reasons.
(nevermind the math or common sense)

This is what amazon did. And the single fiber link between the datacenters had
a tendency to get dug up by everyone pretty much. During these outages each DC
had its external connectivity working.

And now you know why Dynamo was built the way it was. Full split brain mode
was common because service owners would put 50% of their fleet in 2
datacenters. Usually EXACTLY 50% as well, since humans tend to "snap to" even
numbers.

One last point, in the CAP theorem proof, "partitions" really just means loss
of packets to/from one node. That is the smallest/general case.

Even if one can handle single node failure well (eg: dynamo systems) it may or
may not be able to handle full split-brain mode. That is a lot harder and
there isn't any real CS to help you here (not that the CAP theorem is helpful)

~~~
aphyr
_One last point, in the CAP theorem proof, "partitions" really just means loss
of packets to/from one node. That is the smallest/general case._

I don't know where you got this idea from. The formal model in Gilbert and
Lynch's proof is arbitrary packet loss, which is equivalent to an arbitrary
pattern of isolation between groups (of any size) of nodes. See section 2.3:

[http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-
SigAct.pd...](http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf)

 _Even if one can handle single node failure well (eg: dynamo systems) it may
or may not be able to handle full split-brain mode._

Just so we're clear, Dynamo does handle totally isolated components during
partition. This is an explicit goal of the paper, and it provides specific
guarantees about availability in those circumstances.

 _That is a lot harder and there isn't any real CS to help you here (not that
the CAP theorem is helpful)_

There is a great deal of CS to help you deal with partitions. You might start
with Lamport, Lynch, Gray, Brewer, Chandra, Cheriton & Skeen, etc...

------
fragmede
While the article has collected an impressive list of public failures and
post-mortems, I wouldn't exactly call them common. It also closes with the
assertion that "some networks really are reliable. Engineers at major
financial firms report that despite putting serious effort into designing
systems that gracefully tolerate partitions, their networks rarely, if ever,
exhibit partition behavior. Cautious engineering (and lots of money) can
prevent outages."

What does lots of money actually get you in this case? The more esoteric the
hardware, the _less_ real-world testing I would expect it to have gone
through. Something like Solaris on SPARC might avoid unreliable NIC/drivers
combinations but that's only one of the failure modes listed.

Having a completely identical set of hardware purely for testing would be nice
which is where lots of money would help, but some of these failures are so
arcane, they sound like they'd be hard to replicate on duplicate hardware,
never mind testing for them in the first place.

~~~
aphyr
_While the article has collected an impressive list of public failures and
post-mortems, I wouldn't exactly call them common._

Remember, risk is a combination of frequency and severity. A low probability
of catastrophic failure can be worth addressing: look at car crashes, for
example. Moreover, some of these uncommon events have widespread consequences
--remember the EBS/RDS outage? That partition took out hundreds of companies
for several hours.

Part of the problem is that because networks and applications are very
different from one another, and because we tend to _fix_ known problems in
systems, it's rare for us to see the same failure twice. That's part of why
these failures trend towards the esoteric side.

 _Having a completely identical set of hardware purely for testing would be
nice which is where lots of money would help, but some of these failures are
so arcane, they sound like they'd be hard to replicate on duplicate hardware,
never mind testing for them in the first place._

It varies. Amazon and Google are tightly constrained by computational costs:
efficiency matters. Merrill Lynch, meanwhile, has some 27 billion dollars in
annual revenue--and requires nowhere near Amazon's computational capacity to
realize that income. They can afford to buy expensive hardware, to rack it in
tightly controlled datacenters, and to, should they desire, rehearse network
transitions on isolated redundant hardware. They have more predictable load
and slower-changing application goals than most startups. They also have fixed
operating hours for many services, which helps them meet strict SLAs.

All this comes at the cost of increased complexity, longer developer and
operations engineering lead time, and higher capital costs. However it _is_
possible to design networks with significantly reduced probability of failure,
and a few organizations achieve that goal to varying degrees. We just wanted
to acknowledge it in the post.

~~~
fragmede
Excellent article, btw.

> They can afford to buy expensive hardware,

Simply saying their hardware is more expensive doesn't say much. Amazon and
Google aren't running their data-centers on $50 Gig-E switches they picked up
on sale from Fry's either - the switches in their data centers are _also_
high-end switches that cost more than a car.

What about Merrill Lynch-grade expensive hardware actually avoid problems on
this list? Do they have super-redundant systems where I can swap out a bad
motherboard without any downtime - which exists, is super expensive, but
_still_ wouldn't have prevented half the issues on the list (in-and-of-itself
anyway. Having a single monolithic system that 'never' goes down vs an
N-machine HA setup would have avoided heartbeat-related issues since there is
no heartbeat to do.)

> to rack it in tightly controlled datacenters, and to, should they desire,

Google and Amazon owns the datacenters in their entirety for some locations,
so I'm curious what those tighter controls are/do.

> rehearse network transitions on isolated redundant hardware.

Isolated redundant hardware would help in certain test cases, but that doesn't
test everything - often (and especially in the case of such esoteric corner-
case issues), a live load is the only way to test the system.

> They have more predictable load and slower-changing application goals than
> most startups.

Oh absolutely, but that's not due to money or the engineers, but simply
because their 'product' is strictly defined, whereas a startup may not even
know what their product is, and which may change during their lifetime.

> They also have fixed operating hours for many services, which helps them
> meet strict SLAs.

Ah, that is certainly an advantage for keeping systems up - scheduled
downtime. Taking the system offline to perform upgrades/other maintenance is
certainly something that does not get enough attention.

> We just wanted to acknowledge it in the post.

Sure. I'm taking umbrage at phrase "on the other hand, some networks really
_are_ reliable" without solid evidence to back it up. Major financial firms
have very little motivation to doing the same sort of public-disclosure and
post-mortem that this article is really about, and furthermore the limited
feature set and user base of their 'product' only serves to hamper any
exposure it could have.

Somewhere inside that major financial firms who's networks 'rarely' exhibit
partition behavior, is a sysadmin who h stories will never see the light of
day, and we are seeing 'how the meat is made', so to speak.

When was the last time you did a search via Google that threw an exception vs.
how much do we hear about their software breaking or their equipment failing
(in that pseudo open fashion of theirs - who knows how many servers they
actually have now).

~~~
josh2600
Merrill is buying top of the line Cisco gear. Google is making its own
switches and routers because of costs. Cisco is expensive because the median
time to failure is so high.

Google and Amazon own their datacenters, but they're playing the bargain
basement game. You won't see racks upon racks of Nexus 7k gear in Google, at
least not to my knowledge (please correct me if I'm wrong).

You're right that live loads are different from test loads (entropy) but
having a separate set of test servers still holds a lot of value.

It's just like Aphyr said, when your tolerances are higher, but your costs of
delivering said tolerances as an aggregate are lower, you can afford to over-
provision. A lot of these failures come from blocked I/O or network capacity
failures that just wouldn't happen in a production finance environment because
of over-provisioning.

I hope this lends some light onto your queries :).

