
Andromeda 2.1 reduces GCP’s intra-zone latency by 40% - samaysharma
https://cloudplatform.googleblog.com/2017/11/Andromeda-2-1-reduces-GCPs-intra-zone-latency-by-40-percent.html
======
boulos
BTW, since the blog post doesn't make it explicit: we're down to ~40
microseconds round trip between VMs in the same zone. No placement groups or
infiniband required :).

Disclosure: I work on Google Cloud.

~~~
FBISurveillance
Is this available on GKE? Any action required (e.g. node pool flags) on our
side to get this working?

~~~
jsolson
Yes, this is enabled for all VMs running on Compute Engine (which includes GKE
VMs), however the in-guest iptables &c. bits add non-trivial overhead (I don't
have numbers handy, apologies).

~~~
FBISurveillance
Thanks, good to know.

Regarding in-guest iptables -- there's not much we can do on GKE/Kubernetes
about that. I serve about 1.2M requests per second on my GKE cluster through
GLB and see this overhead very well.

~~~
boulos
The new Alias IP stuff should help with that (though I imagine there will
still be some iptables shenanigans left, so I'm not sure how much it will
help)

~~~
FBISurveillance
AFAIK for GLB to work with GKE (via GLBC) I would still need NodePort service
that routes exclusively via iptables?

FWIW I also look forward to IPVS in k8s 1.9 that should improve this slightly.

------
FBISurveillance
Why don't AWS people actively contribute in HN discussions about their
achievements?

There're 8 comments on this thread and 3 of them are from GCP people giving
less marketing-y insights. Thanks @jsolson and @boulos (it's always
interesting to read your comments).

~~~
boulos
Glad you appreciate it! Jon and I both just like explaining and clarifying. In
this case he did actual work, and along with Jake and others deserve the
credit (I rarely contribute directly to Compute Engine these days, but I still
pretend like I can explain what Jon and others do).

I don't begrudge the folks that prefer to work silently. I'm not adding the
names of any people who worked on this (and there are many!) but perhaps don't
want to be publicly visible. I assume that's why there are less AWS folks for
their launches here, and it's a purely personal decision. You also might be
biased since Jon and I are particularly loud and slack off a lot at work :).

Fwiw, please call us out if you think we're straying into Sales/Marketing.
That's not the intent, and part of why I make sure to put the disclosure on my
posts. Clearly, Cloud is a business, but I'm (still) an engineer. My goal is
that we should build an excellent product, and hopefully that convinces you or
others to use it. If it's not excellent, keep complaining until we improve!

Disclosure: I work on Google Cloud.

------
jacobn
How does this compare/relate to the AWS "enhanced networking"?

~~~
jsolson
They're different approaches to solving the same problem (improved throughput
with lower latency and jitter). The major thing they have in common is that
they both dedicate hardware to the problem.

With respect to AWS, in the historical "enhanced networking" case Amazon
dedicated hardware by offering SR-IOV capable NICs. SR-IOV is a well
understood and effective technique for approaching bare metal performance for
virtualized environments, but it tends to lock you into a particular vendor,
if not specific model, of hardware. I gather ENA does something a bit
different, but I don't know the details.

In Google's case, we dedicate hardware to the Andromeda switch in the form of
processor cores (the "SDN" block in the linked post). This allows us to be
flexible in terms of NIC hardware while presenting a uniform virtual device to
guests, in addition to simplifying universal rollout of new networking
features to all zones/instance types.

Both approaches have tradeoffs, although I _think_ even with ENA AWS hits
~70µs typical round-trip-times while GCE gets down to ~40µs. Amazon's largest
VMs in some families do advertise higher bandwidth than GCE does currently.

(I was the tech lead for the hypervisor side of this launch — Jake, the post's
author, leads the fast-path team for the Andromeda software switch)

~~~
_msw_
Hmmm...

    
    
      ​[ec2-user@ip-10-0-1-56 ~]$ sudo ping -f 10.0.1.111
      PING 10.0.1.111 (10.0.1.111) 56(84) bytes of data.
      .^C
      --- 10.0.1.111 ping statistics ---
      115480 packets transmitted, 115479 received, 0% packet loss, time 5385ms
      rtt min/avg/max/mdev = 0.037/0.039/0.226/0.008 ms, ipg/ewma 0.046/0.040 ms

~~~
_msw_
Different c5.18xlarge instances with netperf TCP_RR, no significant tuning:

    
    
      [ec2-user@ip-10-0-2-191 ~]$ netperf -v 2 -H 10.0.2.52 -t TCP_RR -l 30
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.2.52 () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate         
      bytes  Bytes  bytes    bytes   secs.    per sec   
    
      20480  87380  1        1       30.00    21178.69   
      20480  87380 
      Alignment      Offset         RoundTrip  Trans    Throughput
      Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
      Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
          8      0       0      0   47.217   21178.689 0.169     0.169

~~~
jsolson
c5 wasn't available to me when I made that comment, or at least c5 numbers
weren't — we have them now, although we're observing ~10 µs worse than your
one-off in ours.

It's certainly a nice improvement over what we see on the c4s. Is that using a
placement group to ensure proximity (I believe our tests do, but I'd have to
double check)? Our benchmarking philosophy is generally to aim for "default"
numbers for GCP and "best" numbers for others -- keeps us honest about our
"fresh out of the box" behavior.

Also, if we should be seeing better on earlier instance types, I'd love to
know what we're potentially doing wrong.

------
fivesigma
What prevents the hardware offloading from being used on public interfaces?
Even if microsecond-level jitter reduction on public networks is negligible
this should reduce CPU load in high PPS deployments, right?

~~~
jsolson
edit: It's late here and I think I misread this originally :)

The differential between public IPs and internal IPs is tied into the path
packets take after leaving the host. The path out of the guest is identical
for both, but using VM public IPs (rather than internal) can result in passing
through additional hops versus being routed straight to the target VM. Common
firewall configurations can also impact perf here.

Original comment:

With respect to guest CPU, the approach used by Andromeda 2.1 eliminates VM
exits both on transmit and for interrupt delivery (where supported by Intel).
In that regard it's essentially identical to PCIe passthrough. There are
customers running DPDK to further reduce variance (and eliminate the cost of
interrupt handling entirely).

The choice to not pass through host hardware comes down to a few factors, but
high on the list are supporting live migration and NIC vendor flexibility.

(I worked on this effort; see other comments for specifics)

------
fulafel
The Andromeda description link explains that there are special OS drivers for
it. Anyone know what part of the magic is guest side?

~~~
boulos
Which bit are you referring to? I think the blog posts make this a little too
abstract. It's virtio-net and as mentioned elsewhere, we like and contribute
to the latest kernels to do things like better multiqueue support.

For example, here's the script we install in our guest images and encourage
people to run to make sure interrupts are paired to queues:
[https://github.com/GoogleCloudPlatform/compute-image-
package...](https://github.com/GoogleCloudPlatform/compute-image-
packages/blob/master/scripts/set_multiqueue)

Disclosure: I work on Google Cloud but don't know much about networking.

~~~
fulafel
This bit, "Some of the most valuable enhancements enable VMs built on
supporting Linux kernels to exploit offload/multi-queue capabilities"

I was wondering what is offloaded. I guess virtio-net is a good keyword,
thanks.

~~~
jsolson
At the time Andromeda was originally introduced it took a fairly recent Linux
kernel to get support for multi-queue networking and offloads with virtio-net.
Today anything even moderately recent has support baked in -- specifically
Linux 3.8 and above include multi-queue support (as well as the offloads we
support).

In terms of specific offloads, the big ones are TCP segmentation offload (TSO)
and TCP large receive offload (LRO). These substantially reduce the compute
burden on the guest. Less impactful (although still important) are checksum
calculation and verification offload.

(I was the tech lead for the hypervisor side of this launch — Jake, the post's
author, leads the fast-path team for the Andromeda software switch)

------
polskibus
Is this improvement also available in standard Linux? If not, can it be ported
to benefit all Linux VMs?

~~~
wmf
It sounds like this feature is akin to vhost which has been available in Linux
for a few years. Using vhost-net and OVS or vhost-user and VPP you could build
something similar to Andromeda.

~~~
jsolson
It's similar, although distinct. By building on a common foundation of Google
networking dataplane bits, Jake's team (and peer teams) get easier integration
with Google's other networking infrastructure for features like DoS
protection, encryption, etc. The core bits underlying Andromeda 2.1 are
related to those used for Espresso ([https://www.blog.google/topics/google-
cloud/making-google-cl...](https://www.blog.google/topics/google-cloud/making-
google-cloud-faster-more-available-and-cost-effective-extending-sdn-public-
internet-espresso/) — HN discussion:
[https://news.ycombinator.com/item?id=14037830](https://news.ycombinator.com/item?id=14037830)).

------
YuriGrinshteyn
Congratulations, y'all!

