

Facebook Fabric Networking Deconstructed - Swannie
http://firstclassfunc.com/facebook-fabric-networking

======
firstclassfunc
Thanks @virtuallynathan, @sargun, @tw04.. Yes indeed the fabric switches in
the MDF appear to be Arista 7304.. Once you know the numbers your looking for
you can work backwards to calculate appropriate aggregation.. We know we need
192 (40GE) ports for the ToRs and 192 for the Spine. In the photo it looks
like 60 Ports are wired per switch but at a max of 128 40GE ports, we would
need 3 fully loaded chassis per server Pod. (about 285 for 95 Pods) The real
trick with Fat-Trees are that you do the port reservations up front and you
build out the wiring plant knowing your upper limits to avoid mass rewiring
tasks.

@sargun as you point out there are large scaling challenges with L2 networks
due to the flat addressing and ON^2 learning required (even though there are
plenty of hacks in place here). There are also almost limited or difficult to
scale policy controls for L2 networks making them hard to build differentiated
services. We have known for a while that L2 networks do not scale very well
but have been restricted by the need to support link-local communications due
to legacy applications. Once you take this fundamental constraint off the
table, L3 networks offer a far superior solution even though there are still
challenges with isolation and mobility.

That being said, I don't believe Facebook builds applications that require
link local non-routeables or requires an affinity between applications and the
faux host identity (e.g. the IP Address). As you point out there is a tight
coupling between the mac address, IP Address and Point of Attachment due to a
flawed model which has existed since IP split from TCP in version 4 of the
protocol (i.e. the one common in use today). All of the monkey-patching (i.e.
Trill, SPB, LISP, VXLAN, NVGRE, etc..) has been to deal with this flawed
model.

For those existing application that do suffer from this affinity the solution
de jour has been to use encapsulation protocols (i.e. network virtualization)
to solve mobility (Loc/ID split) while also improving isolation by adding a
new network namespace (e.g. ContextID, VNI, Switch Name, etc... This actually
could have been avoided if we hadn't lost the inter-networking layer of the
stack (see John Day's work for an explanation).

Just because you build a Fat-Tree network as devised by Charles Leiserson
based on Charlie Clos, you have to realize that because of the statistical
nature of communications, shared resources and pathologies related to out of
order packets, saturating the bi-section is extremely difficult.. Some studies
show that latency goes exponential at just 40% offered load. In building any
network the ability to maximize throughput is dependent on the topology first,
then routing and flow-control.. Fat-Trees are designed to maximize the bi-
section for worst-case pairs-permutation (i.e source communicating with each
destination across the min-cut..) and such can waste a proportion of the
capacity of the network depending on the workload. Again, a longer
conversation :)..

------
virtuallynathan
Great work! Very informative.

This photo may change some of your assumptions:
[https://www.facebook.com/AltoonaDataCenter/photos/pb.4403012...](https://www.facebook.com/AltoonaDataCenter/photos/pb.440301219393803.-2207520000.1417172391./731322860291636/?type=3&theater)

In later photos, the ToRs appear to be Arista as well. It also appears they
tend to have only 24 servers/rack, so a non-blocking ToR could be a 48x10G
switch. Although I suppose 4x40 could be used as 16x10G here with the 48 port
ToR.

EDIT: Scratch that, I forgot about the OpenCompute design, it looks like they
have 42-48 server per rack in some cases.

~~~
tw04
Those are DEFINITELY Arista 7304's. So much for "none of the big vendors made
a switch that fit our needs so we built our own". Just seems odd they'd claim
they built all of their own switches, then post those pics.

~~~
virtuallynathan
Their own switch seems kinda goofy - its was huge and only 16x40G...

------
sargun
Couple notes: Mis-spelling: "end-dhosts"

Probably why they went with layer 3: [http://sargun.me/a-critique-of-network-
design-1](http://sargun.me/a-critique-of-network-design-1)

I'm very interested in their Fastpass research, and how it'll allow for better
utilization of network, and actually be able to properly utilize full, or 25%
bisection bandwidth.

