
Building a Billion User Load Balancer [video] - phodo
https://www.usenix.org/conference/lisa16/conference-program/presentation/shuff
======
dorianm
Map of Google data centers:
[http://imgur.com/l1dDdQe](http://imgur.com/l1dDdQe)

Map of Facebook data centers and PoP:
[http://imgur.com/dek8ESX](http://imgur.com/dek8ESX)

~~~
brobinson
Both Google and Facebook both have datacenters/PoPs in Taiwan? I can't imagine
the 24mm people there justify it, and both services are either largely unused
or blocked in the nearby PRC...

I'm more surprised that Google has no presence in Japan, though.

~~~
briandear
There's a major Asia-Pacific undersea cable connected to Taiwan, so Taiwan is
a useful Asia-Pacific gateway.

~~~
mafribe
More precisely, Taiwan is well connected by direct cables to Japan, China,
Philippines, Viet Nam, Malaysia, Thailand, Korea, Singapore and the US. As far
as I'm aware Google owns a cable connecting Taiwan with Japan (part of [1]).

[1]
[https://en.wikipedia.org/wiki/FASTER_(cable_system)](https://en.wikipedia.org/wiki/FASTER_\(cable_system\))

------
noponpop
Interesting that Linux kernel performance (ipvs) is acceptable at l4 vs
something like dpdk. I guess you just overcome the limitation by increasing
the number of l4 instances load balanced by ecmp.

Fun to see DSR in use.

Also interesting to see that all the inherent problems with geolocation via
gslb (DNS client IP is not the same as the real client IP) don't wind up being
a big problem apparently. This seems to be a growing concern in my experience:
users aren't located where thier ISP DNS servers are located.

~~~
79d697i6fdif
It's mostly because the point of DPDK and similar is to go around a lot of the
processing in kernel, and IPVS does exactly this. I'm surprised IPVS isn't
more popular, it's built into the kernel and extremely fast.

HTTP proxy type load balancers are slugs in comparison

Scaling app servers to nearly unlimited size is easy to explain but really
hard in practice. It basically amounts to this:

1) Balance requests using DNS anycast so you can spread load before it hits
your servers

2) Setup "Head End" machines with as large pipes as possible (40Gbps?) and
load balance at the lowest layer you can. Balance at IP level using IPVS and
direct server return. A single reasonable machine can handle a 40Gbps pipe. I
guess you could setup a bunch of these but I doubt many people are over
40Gpbs. Oh, and don't use cloud services for these. The virtualization
overhead is high on the network plane and even with SR-IOV you don't get
access to all hardware NIC queues. Also, I don't know of any cloud provider
thats compatible with direct server return since they typically virtualize
your "private cloud" at layer 3, whereas IPVS actually touches layer 2 a
little. Do yourself a favor and get yourself a few colo's for your load
balancers.

3) Setup a ton of HTTP-proxy type load balancers. This includes Nginx,
Varnish, Haproxy etc... One of these machines can probably handle 1-5 Gbps of
traffic so expect 20 or so behind each layer 3 balancer. These NEED to be
hardened substantially because most attacks will be layer 4 and up once an
adversary realizes they can't just flood you out(due to powerful IPVS
balancers above). SYN cookies are extremely important here since you're
dealing with TCP... just try to set everything up to avoid storing TCP state
at all costs. This also means no NAT. You might want to keep these in the colo
with your L3 load balancers.

4) Now for your app servers. Depending on if you're using a dog slow language
or not, you'll want between 3 and 300 app servers behind each HTTP proxy. You
don't really need to harden these as much since the traffic is lower and any
traffic that reaches here is clean HTTP. Go ahead and throw these on the cloud
if want

~~~
woodcut
What if you're not dealing with millions of connections but instead only a few
thousand from whitelisted IP's and you need to optimise for high availability
& latency? Could it be done with just anycast -> IPVS layer -> app servers ?

~~~
bogomipz
If its stateless traffic then yes.

The ECMP/Anycast just gets you beyond the limit of an single pair of IPVS
boxes which are are kept in sync with keepalived/vrrp for HA.

But a pair of boxes with ipvs + keepalived + iptables should be be able to
handle a few thousand connections no problems. Your concern would then likely
be the bandwidth going through the box. But if your client pull rather than
push using direct server return should be able to get you past the bandwidth
limitations of a single box.

------
fforflo
Slides here

[http://www.slideshare.net/patrickshuff/buildinga-
billionuser...](http://www.slideshare.net/patrickshuff/buildinga-
billionuserloadbalancer-may2015srecon15europeshuff)

------
luizbafilho
For those who are interested in replicate the very same architecture in your
environment, I am working on a similar solution. Please check it out

[https://github.com/luizbafilho/fusis](https://github.com/luizbafilho/fusis)

It is a control plane for IPVS and adds distribution, fault tolerance, self-
configuration and a nice JSON API to it.

It is almost done, but it needs documentation on how to use it.

Moreover, for those who like numbers, I did some benchmarks using a 16 cores
machine with two bounded 10Gbit interface

Scenarios:

1 request per connection with 1 byte: 115k connections/s

20 requests per connection with 1 byte: 670k requests/s

20 requests per connection with 1 megabyte: 14Gbps

It scenario tests one specific aspect of the load balancer.

------
uniclaude
In case anyone else is looking for a transcript: I can't check right now if
this talk is much different from the one he did in 2015, but just in case,
here's a link to the Youtube video for the older one:

[https://www.youtube.com/watch?v=MKgJeqF1DHw](https://www.youtube.com/watch?v=MKgJeqF1DHw)

That's the only way I found to get any sort of text.

------
visarga
Good thing this is out. I'm gonna put one on my blog, in expectation of the
traffic I'd like to have!

------
ju-st
The "cartographer" doesn't use end user related information to select the best
PoP? Or does "Sonar" measure the latency/throughput by looking at existing
connections to users?

Ok I watched the presentation now, Sonar apparently does that indeed.

~~~
patrickshuff
Sonar is the system we use to measure latency from the client devices to all
of our PoPs.

Cartographer is the system that consumes all of these sonar measurements and
uses several other real time data sources (BGP routes, Link capacity, PoP
health, PoP capacity, etc) and continually generates a GLB map for the most
optimal targetting of requests to our PoPs.

~~~
bogomipz
I am curious does Cartographer also adjust iBGP preferences for balancing
amongst transit providers for egress traffic?

------
tonyplee
This also gives interesting overview of SDN, NFV, scale inside google data
center.

[https://www.youtube.com/watch?v=vMgZ_BdipYw](https://www.youtube.com/watch?v=vMgZ_BdipYw)

------
je42
wondering the tinydns and cartographer work together ? i didn't see any
dynamic response capability in tinydns ? did i miss it in the docs of tinydns
?

~~~
nocarrier
Cartographer ingests BGP topology and POP and datacenter health (and other
things), and then pushes what is basicially a lookup table to tinydns. None of
the dynamism is inside the DNS server itself, which lets it focus on what it
is good at--responding to a crap ton of DNS requests per second.

------
acqq
Is this audio/video only or am I missing something? If it is maybe it should
be marked in the title.

~~~
RyJones
There is an embedded video on the page

~~~
acqq
Then maybe the title shoold have [video] at the end?

I was unable to find anything to read.

