Hacker News new | comments | ask | show | jobs | submit login
GLB: GitHub's open source load balancer (githubengineering.com)
478 points by csteinbe 6 months ago | hide | past | web | favorite | 62 comments

I haven't looked at what Google has released, but there are big differences between GLB and Katran. (Not affiliated with any of those companies)

In terms of technology, Katran uses XDP and IPIP tunnels, both upstream in the Linux kernel. GLB uses DPDK, which allows processing raw packets in user space, and Geberic UDP encapsulation + a custom iptables module. Neither DPDK nor the module are upstream.

There are architectural differences as well. Katran is much closer to a classic load balancer, and uses connection tracking at the load balancer to know where to send packets for established flows. GLB has no per-flow state at the load balancer, which gives it the very nice property that load balancers can be added and removed from an ECMP group without disturbing existing connections. There is an academic paper about a system called Beamer, which most likely influenced GLB (or maybe the other way round?). It's a good read, and relatively short.

Finally, Katran is really a C++ library you could build a load balancer on, while GLB comes with batteries included.

I think GLB looks nice, hats off to GitHub for open sourcing it.

> uses connection tracking at the load balancer to know where to send packets for established flows

Kind of, from katran post:

> Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness

> ..

> Katran uses an extended version of the Maglev hash to select the backend server

If the connection tracking state is really not required I don't understand how Katran works. How does it deal with the set of live backends changing? Using Maglev hashing gives you "minimal disruption", not no disruption.

I haven't looked at the code, but this is what they say:

> Katran uses an extended version of the Maglev hash to select the backend server. A few features of the extended hash are resilience to backend server failures, more uniform distribution of load, and the ability to set unequal weights for different backend servers.

GLB also does the same thing with caching:

> The hashing to choose the primary/secondary server is done once, up front, and is stored in a lookup table, and so doesn’t need to be recalculated on a per-flow or per-packet basis.

GLB has that primary/secondary thing which seems to be how it better handles backends coming and going.

Ok, I've looked at the code, and based on that I think what I said in my original post is correct.

GLB is more complex, but will maintain connections in more circumstances.

connection tracking could be turned off. but yeah, but default it uses it (for cases where you need to drain a host w/o distruption of existing connections). it's easy to make katran behave the same way as glb (by disabling connection tracking + writing host module which is doing more or less the same as what glb's agent is doing)

It is unfortunate that Fastly did not open source faild. That's the most elegant solution to draining that I have ever seen.

Fastly's MAC-based solution to this was actually one of the existing implementations we read about back when designing the original implementation of GLB in 2015/16, along with Facebook's IPVS-based solution. We loved the ideas behind Fastly's model, but didn't want to mess with Layer 2 to do it. GLB Director took some inspiration from both designs in the creation of L4 second chance and the L4/L7 split design.

Thanks for the details! I'd love to hear more about any data you have on the efficiency of a "second chance" design vs expanding to three or more failover servers. Very curious if a single alternate is enough to cover majority of incidents (xxx out of 1000 events?) Or how frequently you see failures that fall outside the two chance design decisions.

Keep up the great work!

there is no big difference compare to what glb is doing. they are encoding next server in mac. glb is doign the same w/ metadata inside gue

katran has example of how to use this library. which is (example) more like ipvsadm + kernel.

The Github folks could have used XDP similar as with Katran, I think their solution is very elegant in that the same node would still be able to process other workloads/jobs which is one of the reasons why Facebook decided against DPDK:


Cool use of SR-IOV, I like it. We've done a few (academic) experiments with SR-IOV for flow bifurcation and we've wondered why no one seems to use it like this. The performance was quite good: neglible performance difference between PF and a single VF and only 5-10% when running multiple >= 8 VFs (probably cache contention somewhere in our specific setup).

You seem to be running this on X540 NICs, aren't you running into limitations for the VFs. Mostly the number of queues which I believe is limited to 2 per VF in the ixgbe family. I wonder whether the AF_XDP DPDK driver could be used instead if SR-IOV isn't available or feasible for some reason.

A more detailed look at performance would have been cool. I might try it myself if I find some time (or a student) :)

We found that we could achieve 10G line rate with just the queues available to the VF, the NIC didn't seem to be a bottleneck providing DPDK was processing packets faster than line rate. It's worth noting that other traffic on the PF was/is minimal in our setup.

We tested this using DPDK pktgen on a identically-configured node (GLB Director and pktgen both using DPDK on a VF with flow bifurcation, on 2 separate machines on the same rack/switch), with GLB Director essentially acting as a reflector back to the pktgen node. pktgen was able to generate enough 40 byte TCP packets to saturate 10G with 2 TX cores/queues, and GLB Director was able to process those packets and encapsulate them with a sizeable set of binds/tables with 3 cores doing work (encapsulation) and 1 core doing RX/distribution/TX.

Yeah, 10G just isn't that much nowadays. And bigger NICs have more features in the VFs.

I've just built a quick test setup:

* two directly connected servers

* 6 core 2.4 GHz CPU

* XL710 40G NICs

* My packet generator MoonGen: https://github.com/emmericp/MoonGen with a quick & dirty modification to l3-tcp-syn-flood.lua to change dst mac

Got these results for 1-5 worker threads in Mpps: 3.84, 6.65, 10.17, 11.57, 11.3.

~10 Mpps is about 10G line rate for the encapsulated packets; this seems a little bit slower than I expected and it looks like I might be hitting the bottleneck of the distributor at 4 worker threads. Didn't look into anything in detail here (spent maybe 30 minutes for setup + tests), but we've done some VXLAN stuff in the past which I recall being faster.

This basically looks like an open source Maglev [1]. Awesome!

[1] - https://static.googleusercontent.com/media/research.google.c...

This is similar but not totally. Linux has an open source version of Maglev as a scheduler for LVS since 4.18 (to be released). GLB uses a specific consistent hashing algorithm selecting two servers and a module to let the first server redirect the flow to the second in case it doesn't know about it. This helps minimize disruption even more than with Maglev.

Looks really cool! Though a simpler solution for most people will probably be OpenBSD's CARP protocol to share a single virtual IP between multiple boxes (with for example relayd). ECMP routing can get complex fast.

CARP and the likes rely on the presence of an L2 layer which means you get limited scale or increased risk of a global outage. An L3 design scales very well and is resilient (notably because of the distributed control plane and the inability to create a loop).

Also note that CARP, VRRP, HSRP, and GLBP are all layer 3 load balancing protocols (that rely on layer 2 magic to work) whereas this is a layer 4 load balancer. Meaning I can connect to a single IP and a single port (80) and get consistently and equitably load balanced to exactly one server (of many) at another IP address.

Or VRRP with the open source keepalived, which has been around for a decade+ and works wonderfully on Linux.

That is exactly what Stack Overflow uses: keepalived to manage a virtual IP between two decent sized baremetal HAProxy servers (w/ bonded 10G nics). Works great and combined with DNS or Anycast based load balancing can scale pretty damn well. Definitely worth investigating as a KISS approach. To quote a recent Atwood tweet:

"if I have learned anything in my career, it is the shocking effectiveness of building ... literally the stupidest thing that could work. (And then iterating on it for a decade.)"


Indeed! It powers all of the internal loadbalancing (non-direct customer facing) for ticketmaster.com. I was on the core systems team ~12 or so years ago and learned all about how great it is.

Unfortunately it is not an option on some cloud providers, because you don't get true layer 2.

Off-topic: I love the GLB icon, pure genius!

I am having trouble understanding this passage. I'm wondering is someone could help me understand this as it seems like an important design detail:

>"Another benefit to using UDP is that the source port can be filled in with a per-connection hash so that they are flow within the datacenter over different paths (where ECMP is used within the datacenter), and received on different RX queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header fields)."

A source port in the UDP header still needs to be be just that a port number no? Or are they actually stuffing a hash value in to that UDP header field? How would the receiving IP stack no how to understand a value other than a port number in that field?

Just a guess, but by using UDP transport for the encapsulated data and configuring the module on the proxy to accept UDP on a wide range of ports, you can pick any port you want (not just the destination port of the TCP stream in the encapsulated packet). And if you are using ECMP with a known hashing algorithm you can then use that UDP port to explicitly spread packets across the RX queues on the proxy servers (gaining better performance).

Thanks I think that might be what they mean - hash the source port for better distribution across the RX queues on the destination proxy. Cheers.

This is the first I've heard of Rendezvous hashing. It seems superior in every respect to the ring-based consistent hashing I've heard much more about. Why is the ring-based method more common?

The article states:

>"Each server has a bonded pair of network interfaces, and those interfaces are shared between DPDK and Linux on GLB director servers."

What's the distinction between DPDK and Linux here? It wasn't clear to me why SR-IOV is needed in this design. Does DPGK need to "own" the entire NIC device is that? In other words using DPDK and regular kernel networking are mutually exclusive option on the NIC? Is that correct?

Maybe I don't see the use case since I'm not at that scale, but it seems like a lot of added complexity for what appears to be hacking around using other load balancing solutions as a Layer-4 option?

Edit: It's a question, if you downvote please let me know why it's a better solution.

The goal is to avoid disruption during topology changes. If you have long-lived connections, this is important to keep them alive. The explain this a bit more here: https://github.com/github/glb-director/blob/master/docs/deve...

I have also written about this in a past article: https://vincent.bernat.im/en/blog/2018-multi-tier-loadbalanc... (which may or may not be easier to understand)

I read your link, and it's easy to understand because it's really well explained :-)

Why would one use this over HAProxy?

HAProxy is an Layer 7 (i.e. HTTP, for the most part) load balancer and only handles the use case of spreading load across multiple backends. A single instance binds to a single IP. There's no redundancy; lose the HAProxy and you lose traffic.

For true redundancy, you need a layer above that handles the distribution of traffic to multiple redundant load balancer instances, and GLB does that via ECMP (Equal-Cost Multi-Path) routing. Github supposedly uses HAProxy as their L7 load balancer.

All of this thoroughly explained in the article.

I thought best practice for HAproxy was to run two HAProxy's in parallel with VRRP or DNS load balancing?

Does that not achieve the same outcome? I've used HAProxy for layer 4 in the past without any issue this way.

It does but that setup runs into limits on throughput of individual servers, and doesn't have the same drain/fill/failover capabilities discussed in the article.

To be clear, the HAProxy+vrrp+dns is often a better solution, but this describes an interesting design for a load balancing system that can handle many orders of magnitude more traffic and have maintenance without breaking established connections (one of it's core design features)

Imagine if you need a lot more than two HAProxies and they aren't in the same rack/subnet; that's where more sophisticated techniques come in.

I have very successfully in the past and still do use HAProxy as level 4 LB. It's one of the fastest to my knowledge. I have used HAProxy as entry to big Mesos clusters without any issue before.

One example of using HAProxy as L4 LB instead of letting it do the termination is when it is proxying TLS traffic from and to multiple backends. Or Websocket. Or even as bastion LB for SSH should one bastion go down.

It's not that HAProxy doesn't do L4. As I said, projects like GLB solves how to make the load balancer itself redundant; how to load balance the load balancer, so to speak.

For the cost of a ton of added complexity though? What do you get out of this solution that other solutions don't provide? Say DNS load-balancing, VRRP, CARP, or any other HA solution.

Not for most companies, but Github's scalability requirements are pretty extensive, and really cry out for something more sophisticated than the technologies you mention.

This stuff isn't exactly new; it's essentially the Maglev system described by Google in a 2016 paper. Other companies are now catching up to Google (which is of course 2+ years ahead).

well you can't have 100% redudancy without a virtual ip or bgp. so basically glb-director is the same as just using haproxy + bgp. (bgp can basically do anycast/ecmp multipath really easily. well you still need redudant network routers.)

basically glb-director is the same as just using haproxy + bgp

BGP (really ECMP) doesn't handle failures gracefully; that's the benefit of GLB.

Wouldn't it be possible to use DNS for this, with multiple A entries per LB, a TTL of 30 or 60? And remove unhealthy servers from the list? That would even come with IPv6 support.

Then you could address the LB with an address like some-service.lb.intranet and just use that where ever you would use the original service.

Designs such as GLB can (I haven't looked deeply enough in GLB specifically to see if they can do it or not, but I would assume so) handle director level failures mid-flow, i.e. connection won't be interrupted even if one them dies (packet losses are still likely, but TCP will take care of them). That allows a lot faster recovery than solutions that depend on client's DNS settings.

Additionally DNS will leave your load balancing at the mercy of ISPs DNS server settings. At least in the past it wasn't exactly unheard of that ISPs only cached single A entry so all of their clients would be directed to single server.

That said, DNS based load balancing is generally good enough solution for most of people.

Problem here is you assume that every client honors the TTL. That is a very bad assumption to make.

DNS failover works well in practice.

DNS failover looks like a neat idea, but does not work well that good. Until a new DNS entry propagates it could take a really really long time. also using anycast/ecmp via bgp means that you have a single ip that is highly redudant because it can be backed by many servers.

>GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.

It's right there in 2nd paragraph.

From the article:

>GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.

Yes, I get that. But in Mesos, for instance, you could use overlay-networks to get one virtual IP per HAProxy cluster (i.e. 3 HAProxy instances). They are then round-robin'd on each server in the whole cluster so that you have a transparent LB with a singular IP.

Round-robin routing is one way that really doesn't cut it at the level of scale that Github and others operate at. Overlay networks also typically have no notion of load and how to send traffic to the hop with the least amount of load.

Use least connections for HTTP load balancing, it is load aware.

That's an HAProxy option. What do you do to load-balance HAProxy itself? That's what the article about.

Previous discussion (September 2016): https://news.ycombinator.com/item?id=12558053

This new post is about the newly released GLB Director. The title should be changed.

Will it have some special love in azure ecosystem?

Azure already offers load balancing. I'm not sure how much differentiates all the products out there now. I've never seen a load balancer be a bottleneck in any system I've worked on.

Where is the source code? Skimmed but didn’t see a link.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact