

Why is TCP accept() performance so bad under Xen? - DanWaterworth
http://serverfault.com/questions/272483/why-is-tcp-accept-performance-so-bad-under-xen

======
aliguori
This is a very well known issue (at least among virtualization developers
:-)). ESX does a surprisingly good job handling this but both Xen and KVM are
still trying to catch up here.

The issue is small packet performance. You can isolate it pretty easily with
netperf TCP_RR. In order to send a packet, the hypervisor needs to switch from
the guest, to the hypervisor, and in the case of Xen, to domain-0. These
switches are very expensive.

Normally, you don't notice this because almost all I/O in hypervisors is done
with a data structure known as a lockless ring-queue. These data structures
are extremely efficient at batching requests in such a way as to minimize the
overhead of world switching by trying to never do it.

But TCP_RR is the pathological test case for this. No matter how smart the
data structure is, you still end up taking one exit per packet. In particular,
with small packets, you've got multiple world switches to move a very small
number of bytes (usually around 64).

There are ways to improve this (using things like adaptive polling) but this
is still an area of active development. I don't follow Xen too closely any
more but we've got quite a few new things in KVM that help with this and I
would expect dramatic improvements in the short term future.

------
abofh
First, you should test on an unloaded domU, not anything in EC2;

\- EC2 will dynamically adjust your CPU share as you try and use it, so you
will _not_ get consistent results over any short period. \- EC2 is subject to
other people's loads, which may be IO, CPU or network bound.

Going from there: \- Xen is slower any time you need dom0/domU coordination -
it wouldn't surprise me to learn that there's some sort of coordination
happening in accept() to tag the session through the upper dom0 firewalls. \-
You don't describe what your backlog is on the listening socket, but you
should make sure you're accepting as many as you can during your CPU share on
EC2 -- your slice _will_ be interrupted at inopportune times.

Finally, EC2 is _lousy_ performance-wise, especially w/r/t disk IO - it
doesn't sound like it, but if you're logging to disk after accept(), this
could be the killer.

Tangentially -- you _might_ get better accept() performance if you turn ON
syncookies, as then the handshake occurs basically at the kernel, and the
accept() is only relevant _after_ the handshake is done. It's a bit hacky, but
with _large_ numbers of connections, it can improve your performance a bit.

~~~
cgbystrom
(I wrote the Serverfault question)

Thanks for the suggestions, here's some clarifications:

* I clarified what hardware I've tested in a comment (see below).

* I've run tests with my server ranging from 10 sec up to 10 minutes. They're consistently bad unfortunately.

* Interesting what you say about dom0/domU. I'm no Xen guru, but the culprit is probably something like that. I've been using a backlog of 1024 for the server tests (set both in Java land and sysctl.conf). The netperf are all defaults, both in terms of run time and backlog. Was actually trying to monitor the backlog somehow, but I'm not sure that's even possible in Linux?

* The server isn't doing anything disk IO-bound so this shouldn't be the case.

* syncookies seems like a good idea, I will definitely try that along with a lot bigger backlog and see if it makes any difference. I'll also try and see if netperf can be tweaked as well to provide a better, isolated test case.

Writing this off as Xen overhead would be such a shame, virtualization should
not cause this much overhead. I'll continue investigating!

~~~
ay
A nit re. syncoookies: TCP three-way handshake _always_ occurs in the kernel.
You can grab the kernel source, the relevant stuff is in net/ipv4/tcp_input.c;
What syncookie _may_ somewhat help you with is if you have super-large number
of half-open (SYNRCVD state) connections - and even then, the data structures
for those are supposed to be efficient enough for this not to be a problem.

OTOH, what you _will_ be trading off with syncookies, is that they subtly
violate the TCP standard, and especially this will be a problem with the
"server talks first" connections (like [E]SMTP, SSH): if your third ACK of the
three way handshake gets lost on the way from client to the server, the
canonical TCP implementation would have retransmitted the SYNACK from the
server side. Except that the whole point of the syncookies is not to keep the
state on the server side, i.e. there is nothing that can at all retransmit the
SYNACK.

The HTTP, being by nature "client talks first" protocol, hides this problem
(the ACK for the SYNACK will be effectively retransmitted because it will be
part of the data segment), but I thought it might be useful to remind that
turning syncookies on by default is not the standard modus operandi.

------
adamt
The fact that the CPU is high on one CPU suggests that to me all interrupts
from the NIC are going to just the one CPU (generally default on Linux). An 8
core ec2 machine has lots of total CPU, but individual cores are not that
fast.

Change the interrupt cpu affinity to split network interrupts over multiple
cores. See:

    
    
      http://www.cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt

~~~
nodata
Won't irqbalance do this?

~~~
ciupicri
Unfortunately irqbalance has some limitations on some multiple core CPUs. On
my Core 2 Duo 6400 CPU it doesn't do anything. I also found this in the man
page:

 _This raises a few interesting cases in which the behavior of irqbalance may
be non-intuitive. Most notably, cases in which a system has only one cache
domain. Nominally these systems are only single cpu environments, but can also
be found in multi-core environments in which the cores share an L2 cache. In
these situations irqbalance will exit immediately, since there is no work that
irqbalance can do which will improve interrupt handling performance._

------
justincormack
The clue seems to be in very high cpu load on one cpu under xen. Need to do
some digging to see what it is. I would look at the interrupts under xen and
see if they are not being balanced. The config of the network interfaces and
which drivers are being used are key.

There is really not enough information in the post to diagnose though,
although someone might recognise te situation.

------
mike_esspe
You should check if you have a listen queue overflow and syncache buckets
overflow in netstat -s.

For FreeBSD you should also check if there are packet drops in sysctl
net.inet.ip.intr_queue_drops

~~~
cgbystrom
Thanks, noted. Updated the "action plan" in my question at Serverfault.

------
nathanhammond
Carl, I can confirm that this issue exists. We have seen the exact same
performance characteristics on EC2, regardless of instance type. The same
performance characteristics apply even if the instance is a type which would
be singly-hosted on the host hardware. And in spite of days of effort, nothing
I could do seemed to "tune" it out.

Our most amusing result was watching a t1.micro beat the pants off a
cc.4xlarge.

------
icehawk
It would be interesting to compare accept() performance between different
hypervisors to see what the performance hit is with KVM, VMware, etc.

------
kqueue
just a hypothesis. accept() triggers a context switch. Context switches
involves MMU commands, and can be slower in xen if it is being emulated in
software (type2 hypervisor). This is much slower than a native OS doing a
context switch and the MMU operations are executed at the CPU level.

~~~
kijiki
Real hardware (and 32bit paravirtualized Xen) don't require any MMU changes
during a syscall. This is because on real hardware, there is no VMM (Virtual
Machine Monitor) to protect, and on 32bit paravirt, Xen can use x86 segments
to protect the monitor.

For better or for worse (most would say better), segment limit checking is
disabled in 64bit mode, so 64bit paravirtual Xen has to use the MMU to protect
it's monitor. Basically, both the kernel and userspace actually run in ring3,
but on different page tables. This means that expensive MMU updates (and TLB
flushes) are required both on the way into and out of the kernel for every
syscall.

~~~
aliguori
I'd be surprised if it was 64-bit. 64-bit PV Xen is awfully slow because of
exactly what you reference. HVM would be quite a bit better.

~~~
kijiki
All the EC2 instance types mentioned in the article are 64 bit PV.

