The Network is Reliable

falcolas · on Aug 11, 2014

The network is not reliable, but usually the cost of manually fixing problems arising from infrequent types of instability is less than the cost of pre-emptively addressing the issue.

As a practical example, our preferred HA solution for MySQL replication has effectively no network partition safety - if a network becomes partitioned, we'll end up with split brain. However, we have not once had to deal with this specific problem in our years of operation on hundreds of servers.

That said, do make the assumption that your AWS instances will be unable to reach each other for 10+ seconds on a frequent basis. Your life will be happier if you've already planned for that.

saryant · on Aug 11, 2014

> That said, do make the assumption that your AWS instances will be unable to reach each other for 10+ seconds on a frequent basis. Your life will be happier if you've already planned for that.

This was the biggest shock for me when I first moved an Akka cluster into production on EC2. Running with just the default settings we routinely saw our Akka nodes marked as unreachable by the rest of the cluster due to EC2 network noise. We wound up pushing our launch back in order to fix the issue because we just couldn't stay online in a predictable fashion. (The issue was compounded by some other problems though, wasn't all AWS's fault)

yomrwhite · on Aug 11, 2014

I think there's an important truth to learn from your comment: economics actually guides better approaches to development and design and helps to avoid premature/unnecessary optimization. There are probably many great solutions that can be invented, but unless you start thinking about economics of them, you're not working on making them truly remarkable.

0xdeadbeefbabe · on Aug 11, 2014

Yes it can be a good guide, but don't forget economics guided us to network address translation instead of ipv6. Maybe it just isn't done guiding.

mcguire · on Aug 11, 2014

Economics' guidance, like evolution by mutation and natural selection, is a random walk with a lot of local optima.

blutoot · on Aug 11, 2014

Isn't NAT anti-economics? Sure, it helps reuse a limited set of IP addresses, but it also introduces unnecessary layers which doesn't seem cost-effective beyond a certain scale.

0xbadcafebee · on Aug 11, 2014

Takeaways:

* Network partition tolerance can be designed around, assuming infinite time and money

* Network partition tolerance depends on the application

* Mitigating potential failure requires having a very long view on very fine details

* Most organizations will not be able to engineer solutions to address all network partition-related outages

jaybuff · on Aug 11, 2014

Sounds like the title of the paper should be "The Network is Unreliable, but it Usually Doesn't Matter."

0xbadcafebee · on Aug 11, 2014

I think it does matter, but it's good to be realistic about it.

If the availability of your application over the internet determines your revenue flow, you probably do want to try your best to make it as reliable as practical. Creating a mesh network to serve a large website is overkill, cost-wise. So you do what you can, and then you implement things like continuous integration, monitoring, trending, alerting and escalation procedures to be ready for the eventual failure.

In my experience, relying on a service provider designed to increase reliability (let's say Akamai) will smooth out the great majority of problems a live website might have with availability. One really reliable datacenter will keep the big problems at bay, leaving you to get good at iterating over common minor issues like maintenance and local performance issues.

jrullmann · on Aug 11, 2014

Great article. A lot of engineers don't have personal experience with these kinds of network failures, so sharing stories of their consequences means more engineers can make informed (and conscious) decisions of how much risk can be tolerated for their applications.

One thing that you could gleam for this article-and I think that this is incorrect-is that the application or operations engineer is responsible for understanding the nuances of distributed systems. In my experience the number of people who are relying on distributed systems is much larger than the number of people who understand these issues.

So what we really need are systems we can build on whose developers understand how to build (and test!) the nuances of data convergence, consensus algorithms, split-blain avoidance, etc. We need systems to gracefully-and automatically-deal with and recover from network failures.

Full disclosure: I'm an engineer at FoundationDB

blutoot · on Aug 11, 2014

I feel like the authors (or someone else) can do a lot more justice to their overall objective (i.e. tease out patterns) by applying some kind of a qualitative content analysis of case studies [0].

[0] http://www.qualitative-research.net/index.php/fqs/article/vi...

blutoot · on Aug 11, 2014

There was some discussion on a preliminary version of this article/blog-post[0] last year: https://news.ycombinator.com/item?id=5820245

[0] http://aphyr.com/posts/288-the-network-is-reliable

jchrisa · on Aug 11, 2014

Related reading on data structures that make availability easier to maintain under network partition: http://writings.quilt.org/2014/05/12/distributed-systems-and...

KaiserPro · on Aug 11, 2014

The head states that the network is reliable, but then goes on to list lots of cases where the network fails.

Argorak · on Aug 11, 2014

The head quotes a famous list and then goes on to list where the original quote was proven right.