
Facebook Operations Chief Reveals Open Networking Plan - SemanticWeber
http://www.enterprisetech.com/2013/10/04/facebook-operations-chief-reveals-open-networking-plan/
======
Someone
_" It starts up, it has a BIOS environment to do its diagnostics and testing,
and then it will look for an executable and go find an operating system."_

All switches on the internet booting from the network? That's the wet dream of
every hacker, including the NSA and their colleagues in other countries. The
cherry on the cake would be if these things could get a BIOS update without
human intervention. I hope they get the security right.

~~~
wmf
All of Facebook's servers already operate this way. The network is more
sensitive but probably not by much.

------
pm90
_> So we run BGP [Border Gateway Protocol] all the way to the top of the rack.
A lot of the big guys are doing that now – Microsoft, Google_

Can a more knowledgeable person please explain what he means by that? I was
under the impression that BGP was only for inter-ISP (or "Autonomous Network")
connections, and that its slower than spanning tree.

~~~
fb_phoose
Hi pm90, I work on the facebook network engineering team, I can explain at a
high-level some of the reasons we made this change.

You're correct BGP was originally intended for inter-AS routing, however,
those AS's don't necessarily have to be different ISPs. They could be
different divisions in a company, or different racks in a data-center. They
can be any grouping where you might want different policies applied. You can
also use BGP between devices in the same AS where a common policy makes sense.
There's no technical limitation stopping you from using BGP inside the data-
center, so it comes down to pros and cons of the various options out there.

Our main goals were scalability, reliability and simplicity. When it comes to
scalability BGP is essentially the best, this is because, and why it's used
between AS's where the routing tables can grow quite large. When it comes to
reliability, we previously had a layer2 architecture, but there are a lot of
challenges with that at scale, some examples:

* Split brain: In a layer2 world, you still typically have a layer3 protocol running between your upper layers. We were using BGP and VRRP at the time, and issues in one wouldn't always be communicated to the other, in some cases that's not possible. So VRRP may think everything is fine, while BGP is down, the result is black-holed trafic.

* Scale: Running an all layer2 network requires devices that have large enough CAM tables to support all connected devices, many vendors newer full line rate cards were coming out with smaller CAM tables, as such layer2 simply wasn't an option in some cases. Additionally, many of the other protocols that were options are computationally expensive to deploy at a large size, you either end up slowing them down, or deploying relatively complex designs.

One of the many significant benefits BGP brought us came in the form of ECMP.
In a layer2/VRRP world, balancing traffic is hacky with two destinations, if
you want to do more, which we did, it becomes increasingly complex. With BGP
it's built in, we can consistently and easily balance the load to 8, 16, 32 or
more destinations depending on the hardware and software in use.

BGP is also readily available in the software from the major network vendors,
as well as within OpenSource solutions for the servers. This means we can use
a common protocol everywhere. Which in turns makes sourcing and deploying new
equipment much easier and simplifies the configs, policies and tooling needed
to support the deployment.

There are many other reasons that layer2 wasn't a good choice for us, and that
layer3 makes a lot of sense. I'd be happy to discuss more of these as well.

Hopefully this helps to answer your question.

~~~
pkj
>Running an all layer2 network requires devices that have large enough CAM
tables to support all connected devices, many vendors newer full line rate
cards were coming out with smaller CAM tables, as such layer2 simply wasn't an
option in some cases.

Does this imply that mac forwarding tables(on switch) and arp cache(on hosts)
need to have entries only for their immediate neighbours ?

Curious to know how much modified the host network stack is. Also, how do you
provision a new server with the right IP ? Is this mechanism in L2 ?

>There are many other reasons that layer2 wasn't a good choice for us, and
that layer3 makes a lot of sense. I'd be happy to discuss more of these as
well.

I am sure people would find that very useful. Thanks for the excellent writeup
!

~~~
jimB0
The last-hop switches facing the servers are layer 3 in the direction of the
aggregation switches and layer 2 in the direction of the servers so ARP is
handled normally and DHCP can be handled with standard DHCP-helper mechanisms.
This allows the host networking stack to be vanilla (if you want it to be).

------
bitops
He's right - this is definitely where the network needs to go. When we can
manipulate network topology as easily as we can provision cloud-based servers
today, we'll unimaginably more flexible than we are today.

~~~
donavanm
Virtual topology. Someone is still running cables at the end of the day. And
fat switch fabrics have some scary cabling indeed.

------
jdmitch
_We want to shoot for all of the things we want in open switches, but you
can’t boil the ocean. We want hardware disaggregated from software, and we
want to deploy that hardware in production._

I can't decide if this sounds a bit defeatist or is just pragmatic - isn't the
goal lofty enough to require the ambition to boil the ocean?

~~~
pkj
Sure, we are trying through global warming :)

On a serious note, take a look at cumulus networks[1] who claim to solve the
hw/sw disaggregation problem. They have a linux OS distro which can run on h/w
of multiple vendors ( not the popular ones like cisco/jnpr/hp/brcd etc.. since
those are closed platforms).

[1]
[http://cumulusnetworks.com/product/overview/](http://cumulusnetworks.com/product/overview/)

~~~
devicenull
I so wish that pricing was available for any of their supported hardware :/

~~~
wmf
See
[http://www.colfaxdirect.com/store/pc/showsearchresults.asp?I...](http://www.colfaxdirect.com/store/pc/showsearchresults.asp?IDBrand=31&iPageSize=50)

------
roozbeh18
SNMP is dead

~~~
packetslave
Not so much, no

------
untilHellbanned
"Proceed and Be Bold"

And the anti-correlation between those who chestbeat stuff like this and those
who actually do it grows stronger...

