
Google shares software network load balancer design powering GCP networking - rey12rey
https://cloudplatform.googleblog.com/2016/03/Google-shares-software-network-load-balancer-design-powering-GCP-networking.html?m=1
======
teraflop
The consistent hashing stuff in the paper is pretty cool. In order to
distribute traffic among backends, they came up with a new "Maglev hashing"
algorithm that gives a more even distribution than existing techniques, but is
less robust to changes in the set of servers. The trick is that you can
locally memoize the results of your mostly-consistent hash function, and rely
on an extra layer of connection affinity from the upstream ECMP routers to the
load-balancers, so that the same memoized value is used every time. So you
never actually drop connections unless _both_ a backend instance and a load
balancer fail at the same time, which should be very rare. Clever!

As an aside, I couldn't help noticing these lines on adjacent pages:

> Maglev has been serving Google’s traffic since 2008. It has sustained the
> rapid global growth of Google services, and it also provides network load
> balancing for Google Cloud Platform.

> Maglev handles both IPv4 and IPv6 traffic, and all the discussion below
> applies equally to both.

In that case, any chance GCE will be getting IPv6 support soon? ;)

~~~
justincormack
Google uses ipv6 internally for its services (no ipv4 at all?) but doesn't see
fit to let the rest of us use ipv6 apparently, very annoying.

~~~
mikecb
Source?

~~~
novaleaf
as a GCloud user I've never seen any mention of IPv6. That's the main reason
i've not bothered trying to get on the ipv6 bandwagon. I don't want to strugle
through it if my hosting provider doesn't even support it.

~~~
mikecb
CloudSQL instances support ipv6,[1] and appengine apps can receive ipv6
traffic, but only suggestion regarding delay in getting ipv6 to users was
hardware limitations.[2] I wanted to know why op thinks this is only for
external customers.

[1] [https://cloudplatform.googleblog.com/2014/11/cloudsql-
instan...](https://cloudplatform.googleblog.com/2014/11/cloudsql-instances-
now-come-with-free-ipiv6-address.html)

[2] [https://code.google.com/p/google-compute-
engine/issues/detai...](https://code.google.com/p/google-compute-
engine/issues/detail?id=8)

~~~
novaleaf
cool, thanks for the extra details. maybe they are using ipv6 for a portion of
the internal infrastructure and the OP extrapolated that to mean they use it
everywhere internally.

------
gtaylor
I don't need to tell you how much of an ecosystem exploded after the BigTable
whitepapers. It might not be realistic to expect that kind of reaction for
load balancing tech, but it's great to see them continue to do this sort of
thing. Also looking forward to seeing what this release in particular may
inspire.

~~~
jauer
ECMP is pretty old-hat. Example from 2007:
[https://www.nanog.org/meetings/nanog41/presentations/Kapela-...](https://www.nanog.org/meetings/nanog41/presentations/Kapela-
lightning.pdf)

Exactly how the routes get steered to servers can vary (static routes with
track objects, OSPF or BGP on the server), but the basic idea of using network
gear to ECMP traffic between servers is, if not ancient, pretty common.

I suspect there's something novel in control plane or hashing used to steer
ECMP but I didn't see anything about that in the article.

~~~
transitorykris
ECMP is old, we've being pulling this stunt for a long time, for the right
applications (e.g., UDP, short lived TCP). Adding/removing routes in ECMP can
cause flows to hash to a new destination. The connection tracking and
consistent hashing in Maglev is important to deal with faults and avoid
disrupting unrelated flows. Still nothing new.

~~~
kijiki
We (Cumulus Networks) support a feature called "resilient hashing" that
ensures that if a software LB fails, only the flows going to that LB are
redistributed to the remaining LBs.

You still lose some connections when an LB fails, but only the ones going
through the failed LB. Unrelated flows to other LBs are not impacted.

We've got multiple customers doing variants of the LB architecture Google
talks about here.

~~~
newman314
I didn't know Cumulus had a LB product. Any more details that you can share?

~~~
wmf
You can turn any switches into ghetto load balancers by running BGP on some
hosts and advertising a /32 into your switches.

~~~
bogomipz
You want BGP on your edge router, where your transit connections, these are
what do the ECMP towards your LBs which speak iBGP. I'm not sure where you
would use a cheap switch in this setup, certainly not at the edge.

~~~
wmf
Many recent data centers run eBGP between all switches. I can't explain the
rest without diagrams.

~~~
bogomipz
Huh? Then you must not understand what you are talking about very well. Are
you talking about running BGP to ToR switches? What does that have to do with
ECMP or load balancers? Its just an L3 design and its not new.

------
matt_wulfeck
> This is possible because Google clusters, via Maglev, are already handling
> traffic at Google scale. [...] Google’s Maglevs, and our team of Site
> Reliability Engineers who manage them, have that covered for you. You can
> focus on building an awesome experience for your users, knowing that when
> your traffic ramps up, we’ve got your back.

This is probably one of he most compelling arguments I've heard: "It will work
well because we both rely on it working well. We have skin in the game."

------
ivan_ah
Great blog post! Thoughtful explanations and links to research papers[1] is a
much classier way to compete and differentiate compared to just saying "we're
better, we're better" like they've done before[2].

I love the jab at AWS in the end:

> 'knowing that when your traffic ramps up, we’ve got your back.'

They've heard that AWS load balancers need "prewarming" to handle huge traffic
spikes, so they're throwing one in there...

[1]
[http://research.google.com/pubs/pub44824.html](http://research.google.com/pubs/pub44824.html)
[2]
[https://news.ycombinator.com/item?id=10287318](https://news.ycombinator.com/item?id=10287318)

------
_wmd
The paper talks a lot about pushing Linux out of the way in the receive path
(including to avoid copies), only to have a demultiplexing thread steer
packets once they arrive in userspace. Linux has PACKET_MMAP for a long time
and modern linux has RPS, which would seem to exactly duplicate this
functionality without the copies or single point of contention (one thread
running on one core, vs. N interrupt queues directly waking threads pinned on
N cores). Can anyone comment on why they aren't using it?

> Because the NIC is no longer the bottleneck, this figure shows the upper
> bound of Maglev throughput with the current hardware, which is slightly
> higher than 15Mpps. In fact, the bottleneck here is the Maglev steering mod-
> ule, which will be our focus of optimization when we switch to 40Gbps NICs
> in the future

Sounds like they could use RPS

~~~
asuffield
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an
SRE at Google, and I'm oncall for this service.)

On the understanding that I can't tell you any more than what it already says
in the paper, the key detail that you're looking for is section 4.1.2:

""" Maglev originally used the Linux kernel network stack for packet
processing. It had to interact with the NIC using kernel sockets, which
brought significant overhead to packet processing including hardware and
software interrupts, context switches and system calls [26]. Each packet also
had to be copied from kernel to userspace and back again, which incurred
additional overhead. Maglev does not require a TCP/IP stack, but only needs to
find a proper backend for each packet and encapsulate it using GRE. Therefore
we lost no functionality and greatly improved performance when we introduced
the kernel bypass mechanism – the throughput of each Maglev machine is
improved by more than a factor of five. """

My understanding is that RPS just steers packets into the Linux TCP stack.
Nothing on the maglev machine wants to process the TCP layer; that workload
can be distributed over the backends.

------
jhgg
This is awesome! At work, we're a heavy user of google's network load balancer
to route persistent TCP websocket connections to our haproxy cluster. It's
worked flawlessly. Glad to see the design behind this released!

------
_wmd
> to build clusters at scale, we want to avoid the need to place Maglev
> machines in the same layer-2 domain as the router, so hardware encapsulators
> are deployed behind the router, which tunnel packets from routers to Maglev
> machines

Is "hardware encapsulator" just a fancy way of saying they're using tunnel
interfaces on the routers, or is a "hardware encapsulator" an actual thing?

~~~
coops
It's a real piece of hardware.

~~~
_wmd
Commercially available?

~~~
illumin8
Yes, it's just a 10GbE ethernet switch that can encapsulate the traffic in
VXLAN headers, so that it can traverse east/west between any of thousands
(millions?) of hypervisors without requiring traffic to hairpin to a gateway
router and back. The logical networks all exist in an overlay network, so to
the customer VMs, you get L2/L3 isolation. But, to the underlying hypervisors,
they actually know which vNICs are running on each hypervisor in the cluster,
so they can talk directly on a large connected underlay network at 10GbE (x2)
line rate.

This is the standard way of distributing traffic in large datacenters. That
way you get extremely fast, non-blocking line rate between any two physical
hosts in the datacenter, and since the physical hosts know which
VMs/containers are running on them, they can pass the traffic directly to the
other host if VMs exist in the same L2 network, and even do virtual routing if
the VMs exist across L3 boundaries - still a single east/west hop.

~~~
_wmd
So it was a mystery-inducing way of referring to some commodity SDN-related
tech, thanks, that's far more informative than the paper :)

~~~
eisenbud
Fair point, we should just have taken the opportunity to say SDN here.

------
deegles
"There's enough headroom available to add another million requests per second
without bringing up new Maglevs"

How awesome is that?

------
jpgvm
Using ECMP to balance the traffic across the load balancers is definitely the
way to go. You can couple this with a dynamic routing protocol (BGP/OSPF/ISIS)
to achieve fast failure detection of load-balancers too.

We used this strategy at a previous company I worked at deploying large
Openstack based clouds, worked really well. Though if you want to to be easy
on the backends you need to use some sort of stick table or consistent hashing
mechanism as described in the paper.

------
hankerfaster
ECMP has been used for well over a decade. The DPI industry has been doing
this entire load balancing thing for at least 15 years.

10gbps with 8 cores is pretty crappy performance. People using DPDK and netmap
are getting 10gbps with a single core. Netmap can bridge at line rate for 64
byte packets with a single core.

I am actually surprised this is 'news'. Perhaps google should reach out to
some of the DPI companies to figure out how to really scale this. Currently
they are 'overpaying' on commodity hardware by at least 8x over the current
cutting edge in this type of field.

Just have a look at the NFV world for the cutting edge packet forwarding rates

~~~
Swannie
The paper points out that they can do better than 10Gbps, and when looking at
40Gbps, their limitation is their traffic steering piece.

And I'm sure you're aware, bridging is a totally different use case.

------
bigR
Interesting read. I am most fascinated by their kernel bypass mechanism for
packet transmission. They claim that their system requires proper support from
the NIC hardware, does not involve the Linux kernel in packet transmission,
does not require kernel modification, and allows them to switch between NICs.
As well, if I'm reading the claims in the related work section correctly it
seems this solution is architecturally different than netmap/pf_ring or
dpdk/openonload.

------
ryancox
Is this at all related to
[https://github.com/google/seesaw](https://github.com/google/seesaw) ?

------
DanielDent
The ring buffer approach reminds me of the architecture used by the LMAX
Disruptor.

