
Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale - danso
https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable-failure-state-at-scale/
======
cperciva
_The most literal conclusion to draw from this story is that MRU connection
pools shouldn’t be used for connections that traverse aggregated links._

Not just connections which traverse aggregated links. If you're load-balancing
between multiple database replicas -- or between several S3 endpoints -- this
sort of MRU connection pool will cause metastable load imbalances on the
targets.

What I don't understand is why Facebook didn't simply fix their MRU pool:
Switching from "most recent _response received_ " to "most recent _request
sent_ " (out of the links which don't have a request already in progress, of
course) would have flipped the effect from preferring overloaded links to
avoiding them.

~~~
ngbronson
I coded the fix. We considered some options that had a bias away from the slow
links, but we thought it was safer to avoid bias completely. It was fine in my
tests, but I couldn't prove that biasing toward the faster link in a 2 link
setup wouldn't set up an oscillation.

~~~
cperciva
I wondered if that was it, but a 1/RTT loading behaviour seems safe enough to
me -- after all, that's exactly what TCP does, and you're presumably not
seeing any uncontrolled oscillations from that.

I guess it depends how much minimizing the size of your connection pool
matters.

------
mabbo
"The breakthrough came when we started thinking about the components in the
system as malicious actors colluding via covert channels."

That's probably the piece of that article that I'm going to remember for a
long time.

~~~
cperciva
That's exactly the thinking which led me to the first published cryptographic
side channel attack against hyperthreading: Intel's optimization manual had a
comment about being careful with stack alignment to avoid poor cache
performance, and I thought "what if _slow_ could instead be _maliciously
slow_?"

------
jxf
This was a great investigation of what must surely have been a complicated,
frustrating, and expensive problem. Awesome writeup.

I love reading about post-mortems like this, even if they're unlikely to
happen at my startup, because the problem-solving techniques that get
displayed tend to generalize to things of almost any size.

------
noselasd
Many years ago I did programmeing on a telephony application which
communicated using the old SS7 network. In here you, as a layer 4 protocol
entity, can send a "link selection" key in the protocol messages. It's just a
4 bit number (or 8 if it's ANSI SS7 instead of ITU , iirc) per message, and
the switches maps this number to a physical downstream link, often in a
configurable fashion.

This is pretty neat, as the application can default to a round robin
distribution and dynamically weigth the link selection key based on detected
congestion/overload to shift the load to other links - since then I've often
wished the TCP/IP world offered something similar.

------
jessaustin
_This keeps all the packets of a TCP stream on the same link, avoiding out-of-
order delivery._

Not a network engineer, but I had thought that _TCP_ handles ordering itself?
Packets can travel completely different routes from origin to destination, so
one can't expect them to arrive in order. Destination's TCP stack can deal
with it.

Again, IANANE, but ISTM that those who first designed the network introduced
an untested complication with their LIFO setup. Nearly anything would have
worked, including not pooling at all. They just chose something weird for the
hell of it. Later FB needed more performance out of the system, and this
harmful complication was hidden from sight.

~~~
The_Fox
TCP reorders segments just fine but it treats disordering as a congestion
signal and slows down. That's why link aggregation frequently uses the
technique of pinning flows to a single link- to minimize disordering so as to
avoid triggering TCP congestion control.

In Linux, you can change the sensitivity to disordering by writing to
/proc/sys/net/ipv4/tcp_reordering.

------
wazokazi
" At a meta-level, the next time you are debugging emergent behavior, you
might try thinking of the components as agents colluding via covert channels."

An insightful way of looking at a problem when the cause is not obvious.

------
fleitz
Oh man, this reminds me of troubleshooting a host behind a router that didn't
support window scaling.

It took a week or two to figure out why we couldn't transfer data but could
connect, ssh, ping, etc.

~~~
donavanm
Broken TCP window scaling? The connection should still work, with the sender
backing off right? If you mean busted PMTUD yeah thats awesome when the
handshake and GET works, but you cant get data back.

~~~
ak217
We saw busted PMTUD on a customer's network trying to get to us via a
misconfigured network link. It's a bad problem to debug. None of us knew what
PMTUD was, or that TCP can require ICMP to work, until that day.

------
msandford
It took two years to figure this out? When I read the first couple of
paragraphs my first thought was "the low-latency links are getting dropped
from the eligibility pool somehow" and it turns out that was the problem.

I think my intuition there stems from having a lot of full-stack
responsibility (so the idea that it's someone else's problem was never a
luxury I could afford, since it was always my problem) and having cleaned up a
number of other people's very large spaghetti-code disasters.

The other possibility is that I have no special intuition into the problem
(this is far more likely) but did have a fresh set of eyes, while all the
people who were working on the problem were so intimately familiar with it
that they couldn't see the forest for the trees.

~~~
ngbronson
One of the reasons this took a long time to figure out is that it was a
failure amplifier, so there was always a more typical network problem
preceding it. Network failures in a data center cause lots of changes to the
packets, because of retries, failover, and automatic load balancing, so there
were a lot of trees to look at.

~~~
msandford
That makes a lot more sense. It would have been nice to include some of the
troubleshooting process so people can learn from that too. Thanks for sharing!

------
api
This is one of the best engineering write-ups I've ever read. It's also a
principle I'll remember, since I work a lot with distributed systems and p2p
networks: when debugging, imagine systems as adversaries and frame it as a
security problem.

Thanks!

------
donavanm
Color me confused. Just a week ago they posted "We also have server-side means
to “hash away” and route around trouble spots, if they occur."

[https://code.facebook.com/posts/360346274145943/introducing-...](https://code.facebook.com/posts/360346274145943/introducing-
data-center-fabric-the-next-generation-facebook-data-center-network/)

Does the server side flow hash manipulation dependent on ip tuple
manipulation? Only enabled for some hosts/devices? A bit disappointed, was
hoping for clever dscp or mpls tag manipulation. If they rely on new streams
with specific src ports its a lot less interesting, and less useful for long
lived connections.

~~~
mcpherrinm
This problem has been ongoing for two years: the post you linked described
their new data center design, which may not even be online yet.

------
spullara
This is one of the reasons why I like to include debugging a problem with a
couple of levels of indirection for in-depth technical interviews.

------
jimmyhmiller
FYI: The page layout is broken in the latest firefox 34 beta on mac. It
appears to be the image that is breaking it.

