
The Network is Reliable - r4um
http://queue.acm.org/detail.cfm?id=2655736
======
falcolas
The network is not reliable, but usually the cost of manually fixing problems
arising from infrequent types of instability is less than the cost of pre-
emptively addressing the issue.

As a practical example, our preferred HA solution for MySQL replication has
effectively no network partition safety - if a network becomes partitioned,
we'll end up with split brain. However, we have not once had to deal with this
specific problem in our years of operation on hundreds of servers.

That said, do make the assumption that your AWS instances will be unable to
reach each other for 10+ seconds on a frequent basis. Your life will be
happier if you've already planned for that.

~~~
yomrwhite
I think there's an important truth to learn from your comment: economics
actually guides better approaches to development and design and helps to avoid
premature/unnecessary optimization. There are probably many great solutions
that can be invented, but unless you start thinking about economics of them,
you're not working on making them truly remarkable.

~~~
0xdeadbeefbabe
Yes it can be a good guide, but don't forget economics guided us to network
address translation instead of ipv6. Maybe it just isn't done guiding.

~~~
mcguire
Economics' guidance, like evolution by mutation and natural selection, is a
random walk with a lot of local optima.

------
peterwwillis
Takeaways:

* Network partition tolerance can be designed around, assuming infinite time and money

* Network partition tolerance depends on the application

* Mitigating potential failure requires having a very long view on very fine details

* Most organizations will not be able to engineer solutions to address all network partition-related outages

~~~
jaybuff
Sounds like the title of the paper should be "The Network is Unreliable, but
it Usually Doesn't Matter."

~~~
peterwwillis
I think it does matter, but it's good to be realistic about it.

If the availability of your application over the internet determines your
revenue flow, you probably do want to try your best to make it as reliable _as
practical_. Creating a mesh network to serve a large website is overkill,
cost-wise. So you do what you can, and then you implement things like
continuous integration, monitoring, trending, alerting and escalation
procedures to be ready for the eventual failure.

In my experience, relying on a service provider designed to increase
reliability (let's say Akamai) will smooth out the great majority of problems
a live website might have with availability. One really reliable datacenter
will keep the big problems at bay, leaving you to get good at iterating over
common minor issues like maintenance and local performance issues.

------
jrullmann
Great article. A lot of engineers don't have personal experience with these
kinds of network failures, so sharing stories of their consequences means more
engineers can make informed (and conscious) decisions of how much risk can be
tolerated for their applications.

One thing that you could gleam for this article-and I think that this is
incorrect-is that the application or operations engineer is responsible for
understanding the nuances of distributed systems. In my experience the number
of people who are relying on distributed systems is much larger than the
number of people who understand these issues.

So what we really need are systems we can build on whose developers understand
how to build (and test!) the nuances of data convergence, consensus
algorithms, split-blain avoidance, etc. We need systems to gracefully-and
automatically-deal with and recover from network failures.

Full disclosure: I'm an engineer at FoundationDB

------
blutoot
I feel like the authors (or someone else) can do a lot more justice to their
overall objective (i.e. tease out patterns) by applying some kind of a
qualitative content analysis of case studies [0].

[0] [http://www.qualitative-
research.net/index.php/fqs/article/vi...](http://www.qualitative-
research.net/index.php/fqs/article/view/75/153January%202006)

------
blutoot
There was some discussion on a preliminary version of this article/blog-
post[0] last year:
[https://news.ycombinator.com/item?id=5820245](https://news.ycombinator.com/item?id=5820245)

[0] [http://aphyr.com/posts/288-the-network-is-
reliable](http://aphyr.com/posts/288-the-network-is-reliable)

------
jchrisa
Related reading on data structures that make availability easier to maintain
under network partition: [http://writings.quilt.org/2014/05/12/distributed-
systems-and...](http://writings.quilt.org/2014/05/12/distributed-systems-and-
the-end-of-the-api/)

------
KaiserPro
The head states that the network is reliable, but then goes on to list lots of
cases where the network fails.

~~~
Argorak
The head quotes a famous list and then goes on to list where the original
quote was proven right.

