
GLB: GitHub's open source load balancer - csteinbe
https://githubengineering.com/glb-director-open-source-load-balancer/
======
justinsaccount
See also:

Facebook: [https://code.fb.com/open-source/open-sourcing-katran-a-
scala...](https://code.fb.com/open-source/open-sourcing-katran-a-scalable-
network-load-balancer/)

Google: [https://cloudplatform.googleblog.com/2016/03/Google-
shares-s...](https://cloudplatform.googleblog.com/2016/03/Google-shares-
software-network-load-balancer-design-powering-GCP-networking.html)

The design of all 3 is very similar.

~~~
lmb
I haven't looked at what Google has released, but there are big differences
between GLB and Katran. (Not affiliated with any of those companies)

In terms of technology, Katran uses XDP and IPIP tunnels, both upstream in the
Linux kernel. GLB uses DPDK, which allows processing raw packets in user
space, and Geberic UDP encapsulation + a custom iptables module. Neither DPDK
nor the module are upstream.

There are architectural differences as well. Katran is much closer to a
classic load balancer, and uses connection tracking at the load balancer to
know where to send packets for established flows. GLB has no per-flow state at
the load balancer, which gives it the very nice property that load balancers
can be added and removed from an ECMP group without disturbing existing
connections. There is an academic paper about a system called Beamer, which
most likely influenced GLB (or maybe the other way round?). It's a good read,
and relatively short.

Finally, Katran is really a C++ library you could build a load balancer on,
while GLB comes with batteries included.

I think GLB looks nice, hats off to GitHub for open sourcing it.

~~~
notyourday
It is unfortunate that Fastly did not open source faild. That's the most
elegant solution to draining that I have ever seen.

~~~
theojulienne
Fastly's MAC-based solution to this was actually one of the existing
implementations we read about back when designing the original implementation
of GLB in 2015/16, along with Facebook's IPVS-based solution. We loved the
ideas behind Fastly's model, but didn't want to mess with Layer 2 to do it.
GLB Director took some inspiration from both designs in the creation of L4
second chance and the L4/L7 split design.

~~~
gbrayut
Thanks for the details! I'd love to hear more about any data you have on the
efficiency of a "second chance" design vs expanding to three or more failover
servers. Very curious if a single alternate is enough to cover majority of
incidents (xxx out of 1000 events?) Or how frequently you see failures that
fall outside the two chance design decisions.

Keep up the great work!

------
emmericp
Cool use of SR-IOV, I like it. We've done a few (academic) experiments with
SR-IOV for flow bifurcation and we've wondered why no one seems to use it like
this. The performance was quite good: neglible performance difference between
PF and a single VF and only 5-10% when running multiple >= 8 VFs (probably
cache contention somewhere in our specific setup).

You seem to be running this on X540 NICs, aren't you running into limitations
for the VFs. Mostly the number of queues which I believe is limited to 2 per
VF in the ixgbe family. I wonder whether the AF_XDP DPDK driver could be used
instead if SR-IOV isn't available or feasible for some reason.

A more detailed look at performance would have been cool. I might try it
myself if I find some time (or a student) :)

~~~
theojulienne
We found that we could achieve 10G line rate with just the queues available to
the VF, the NIC didn't seem to be a bottleneck providing DPDK was processing
packets faster than line rate. It's worth noting that other traffic on the PF
was/is minimal in our setup.

We tested this using DPDK pktgen on a identically-configured node (GLB
Director and pktgen both using DPDK on a VF with flow bifurcation, on 2
separate machines on the same rack/switch), with GLB Director essentially
acting as a reflector back to the pktgen node. pktgen was able to generate
enough 40 byte TCP packets to saturate 10G with 2 TX cores/queues, and GLB
Director was able to process those packets and encapsulate them with a
sizeable set of binds/tables with 3 cores doing work (encapsulation) and 1
core doing RX/distribution/TX.

~~~
emmericp
Yeah, 10G just isn't that much nowadays. And bigger NICs have more features in
the VFs.

I've just built a quick test setup:

* two directly connected servers

* 6 core 2.4 GHz CPU

* XL710 40G NICs

* My packet generator MoonGen: [https://github.com/emmericp/MoonGen](https://github.com/emmericp/MoonGen) with a quick & dirty modification to l3-tcp-syn-flood.lua to change dst mac

Got these results for 1-5 worker threads in Mpps: 3.84, 6.65, 10.17, 11.57,
11.3.

~10 Mpps is about 10G line rate for the encapsulated packets; this seems a
little bit slower than I expected and it looks like I might be hitting the
bottleneck of the distributor at 4 worker threads. Didn't look into anything
in detail here (spent maybe 30 minutes for setup + tests), but we've done some
VXLAN stuff in the past which I recall being faster.

------
q3k
This basically looks like an open source Maglev [1]. Awesome!

[1] -
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf)

~~~
vbernat
This is similar but not totally. Linux has an open source version of Maglev as
a scheduler for LVS since 4.18 (to be released). GLB uses a specific
consistent hashing algorithm selecting two servers and a module to let the
first server redirect the flow to the second in case it doesn't know about it.
This helps minimize disruption even more than with Maglev.

------
jsiepkes
Looks really cool! Though a simpler solution for most people will probably be
OpenBSD's CARP protocol to share a single virtual IP between multiple boxes
(with for example relayd). ECMP routing can get complex fast.

~~~
SEJeff
Or VRRP with the open source keepalived, which has been around for a decade+
and works wonderfully on Linux.

~~~
gbrayut
That is exactly what Stack Overflow uses: keepalived to manage a virtual IP
between two decent sized baremetal HAProxy servers (w/ bonded 10G nics). Works
great and combined with DNS or Anycast based load balancing can scale pretty
damn well. Definitely worth investigating as a KISS approach. To quote a
recent Atwood tweet:

"if I have learned anything in my career, it is the shocking effectiveness of
building ... literally the stupidest thing that could work. (And then
iterating on it for a decade.)"

[https://twitter.com/codinghorror/status/1026332543153389569](https://twitter.com/codinghorror/status/1026332543153389569)

~~~
SEJeff
Indeed! It powers all of the internal loadbalancing (non-direct customer
facing) for ticketmaster.com. I was on the core systems team ~12 or so years
ago and learned all about how great it is.

------
Drdrdrq
Off-topic: I _love_ the GLB icon, pure genius!

------
bogomipz
I am having trouble understanding this passage. I'm wondering is someone could
help me understand this as it seems like an important design detail:

>"Another benefit to using UDP is that the source port can be filled in with a
per-connection hash so that they are flow within the datacenter over different
paths (where ECMP is used within the datacenter), and received on different RX
queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header
fields)."

A source port in the UDP header still needs to be be just that a port number
no? Or are they actually stuffing a hash value in to that UDP header field?
How would the receiving IP stack no how to understand a value other than a
port number in that field?

~~~
gbrayut
Just a guess, but by using UDP transport for the encapsulated data and
configuring the module on the proxy to accept UDP on a wide range of ports,
you can pick any port you want (not just the destination port of the TCP
stream in the encapsulated packet). And if you are using ECMP with a known
hashing algorithm you can then use that UDP port to explicitly spread packets
across the RX queues on the proxy servers (gaining better performance).

~~~
bogomipz
Thanks I think that might be what they mean - hash the source port for better
distribution across the RX queues on the destination proxy. Cheers.

------
subleq
This is the first I've heard of Rendezvous hashing. It seems superior in every
respect to the ring-based consistent hashing I've heard much more about. Why
is the ring-based method more common?

------
bogomipz
The article states:

>"Each server has a bonded pair of network interfaces, and those interfaces
are shared between DPDK and Linux on GLB director servers."

What's the distinction between DPDK and Linux here? It wasn't clear to me why
SR-IOV is needed in this design. Does DPGK need to "own" the entire NIC device
is that? In other words using DPDK and regular kernel networking are mutually
exclusive option on the NIC? Is that correct?

------
llama052
Maybe I don't see the use case since I'm not at that scale, but it seems like
a lot of added complexity for what appears to be hacking around using other
load balancing solutions as a Layer-4 option?

Edit: It's a question, if you downvote please let me know why it's a better
solution.

~~~
vbernat
The goal is to avoid disruption during topology changes. If you have long-
lived connections, this is important to keep them alive. The explain this a
bit more here: [https://github.com/github/glb-
director/blob/master/docs/deve...](https://github.com/github/glb-
director/blob/master/docs/development/second-chance-design.md#comparison-to-
lvs-and-other-director-state-solutions)

I have also written about this in a past article:
[https://vincent.bernat.im/en/blog/2018-multi-tier-
loadbalanc...](https://vincent.bernat.im/en/blog/2018-multi-tier-loadbalancer)
(which may or may not be easier to understand)

~~~
ngrilly
I read your link, and it's easy to understand because it's really well
explained :-)

------
KenanSulayman
Why would one use this over HAProxy?

~~~
atombender
HAProxy is an Layer 7 (i.e. HTTP, for the most part) load balancer and only
handles the use case of spreading load across multiple backends. A single
instance binds to a single IP. There's no redundancy; lose the HAProxy and you
lose traffic.

For true redundancy, you need a layer above that handles the distribution of
traffic to multiple redundant load balancer instances, and GLB does that via
ECMP (Equal-Cost Multi-Path) routing. Github supposedly uses HAProxy as their
L7 load balancer.

All of this thoroughly explained in the article.

~~~
llama052
I thought best practice for HAproxy was to run two HAProxy's in parallel with
VRRP or DNS load balancing?

Does that not achieve the same outcome? I've used HAProxy for layer 4 in the
past without any issue this way.

~~~
gbrayut
It does but that setup runs into limits on throughput of individual servers,
and doesn't have the same drain/fill/failover capabilities discussed in the
article.

To be clear, the HAProxy+vrrp+dns is often a better solution, but this
describes an interesting design for a load balancing system that can handle
many orders of magnitude more traffic and have maintenance without breaking
established connections (one of it's core design features)

------
toomuchtodo
Previous discussion (September 2016):
[https://news.ycombinator.com/item?id=12558053](https://news.ycombinator.com/item?id=12558053)

~~~
Chris911
This new post is about the newly released GLB Director. The title should be
changed.

------
koolhead17
Will it have some special love in azure ecosystem?

~~~
tootie
Azure already offers load balancing. I'm not sure how much differentiates all
the products out there now. I've never seen a load balancer be a bottleneck in
any system I've worked on.

------
weberc2
Where is the source code? Skimmed but didn’t see a link.

~~~
geospeck
[https://github.com/github/glb-director](https://github.com/github/glb-
director)

