The design of all 3 is very similar.
In terms of technology, Katran uses XDP and IPIP tunnels, both upstream in the Linux kernel. GLB uses DPDK, which allows processing raw packets in user space, and Geberic UDP encapsulation + a custom iptables module. Neither DPDK nor the module are upstream.
There are architectural differences as well. Katran is much closer to a classic load balancer, and uses connection tracking at the load balancer to know where to send packets for established flows. GLB has no per-flow state at the load balancer, which gives it the very nice property that load balancers can be added and removed from an ECMP group without disturbing existing connections. There is an academic paper about a system called Beamer, which most likely influenced GLB (or maybe the other way round?). It's a good read, and relatively short.
Finally, Katran is really a C++ library you could build a load balancer on, while GLB comes with batteries included.
I think GLB looks nice, hats off to GitHub for open sourcing it.
Kind of, from katran post:
> Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness
> Katran uses an extended version of the Maglev hash to select the backend server
> Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers.
GLB also does the same thing with caching:
> The hashing to choose the primary/secondary server is done once, up front, and is stored in a lookup table, and so doesn’t need to be recalculated on a per-flow or per-packet basis.
GLB has that primary/secondary thing which seems to be how it better handles backends coming and going.
GLB is more complex, but will maintain connections in more circumstances.
Keep up the great work!
You seem to be running this on X540 NICs, aren't you running into limitations for the VFs. Mostly the number of queues which I believe is limited to 2 per VF in the ixgbe family.
I wonder whether the AF_XDP DPDK driver could be used instead if SR-IOV isn't available or feasible for some reason.
A more detailed look at performance would have been cool. I might try it myself if I find some time (or a student) :)
We tested this using DPDK pktgen on a identically-configured node (GLB Director and pktgen both using DPDK on a VF with flow bifurcation, on 2 separate machines on the same rack/switch), with GLB Director essentially acting as a reflector back to the pktgen node. pktgen was able to generate enough 40 byte TCP packets to saturate 10G with 2 TX cores/queues, and GLB Director was able to process those packets and encapsulate them with a sizeable set of binds/tables with 3 cores doing work (encapsulation) and 1 core doing RX/distribution/TX.
I've just built a quick test setup:
* two directly connected servers
* 6 core 2.4 GHz CPU
* XL710 40G NICs
* My packet generator MoonGen: https://github.com/emmericp/MoonGen with a quick & dirty modification to l3-tcp-syn-flood.lua to change dst mac
Got these results for 1-5 worker threads in Mpps: 3.84, 6.65, 10.17, 11.57, 11.3.
~10 Mpps is about 10G line rate for the encapsulated packets; this seems a little bit slower than I expected and it looks like I might be hitting the bottleneck of the distributor at 4 worker threads. Didn't look into anything in detail here (spent maybe 30 minutes for setup + tests), but we've done some VXLAN stuff in the past which I recall being faster.
 - https://static.googleusercontent.com/media/research.google.c...
"if I have learned anything in my career, it is the shocking effectiveness of building ... literally the stupidest thing that could work. (And then iterating on it for a decade.)"
>"Another benefit to using UDP is that the source port can be filled in with a per-connection hash so that they are flow within the datacenter over different paths (where ECMP is used within the datacenter), and received on different RX queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header fields)."
A source port in the UDP header still needs to be be just that a port number no? Or are they actually stuffing a hash value in to that UDP header field? How would the receiving IP stack no how to understand a value other than a port number in that field?
>"Each server has a bonded pair of network interfaces, and those interfaces are shared between DPDK and Linux on GLB director servers."
What's the distinction between DPDK and Linux here? It wasn't clear to me why SR-IOV is needed in this design. Does DPGK need to "own" the entire NIC device is that? In other words using DPDK and regular kernel networking are mutually exclusive option on the NIC? Is that correct?
Edit: It's a question, if you downvote please let me know why it's a better solution.
I have also written about this in a past article: https://vincent.bernat.im/en/blog/2018-multi-tier-loadbalanc... (which may or may not be easier to understand)
For true redundancy, you need a layer above that handles the distribution of traffic to multiple redundant load balancer instances, and GLB does that via ECMP (Equal-Cost Multi-Path) routing. Github supposedly uses HAProxy as their L7 load balancer.
All of this thoroughly explained in the article.
Does that not achieve the same outcome? I've used HAProxy for layer 4 in the past without any issue this way.
To be clear, the HAProxy+vrrp+dns is often a better solution, but this describes an interesting design for a load balancing system that can handle many orders of magnitude more traffic and have maintenance without breaking established connections (one of it's core design features)
One example of using HAProxy as L4 LB instead of letting it do the termination is when it is proxying TLS traffic from and to multiple backends. Or Websocket. Or even as bastion LB for SSH should one bastion go down.
This stuff isn't exactly new; it's essentially the Maglev system described by Google in a 2016 paper. Other companies are now catching up to Google (which is of course 2+ years ahead).
BGP (really ECMP) doesn't handle failures gracefully; that's the benefit of GLB.
Then you could address the LB with an address like some-service.lb.intranet and just use that where ever you would use the original service.
Additionally DNS will leave your load balancing at the mercy of ISPs DNS server settings. At least in the past it wasn't exactly unheard of that ISPs only cached single A entry so all of their clients would be directed to single server.
That said, DNS based load balancing is generally good enough solution for most of people.
It's right there in 2nd paragraph.
>GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.