The issue is small packet performance. You can isolate it pretty easily with netperf TCP_RR. In order to send a packet, the hypervisor needs to switch from the guest, to the hypervisor, and in the case of Xen, to domain-0. These switches are very expensive.
Normally, you don't notice this because almost all I/O in hypervisors is done with a data structure known as a lockless ring-queue. These data structures are extremely efficient at batching requests in such a way as to minimize the overhead of world switching by trying to never do it.
But TCP_RR is the pathological test case for this. No matter how smart the data structure is, you still end up taking one exit per packet. In particular, with small packets, you've got multiple world switches to move a very small number of bytes (usually around 64).
There are ways to improve this (using things like adaptive polling) but this is still an area of active development. I don't follow Xen too closely any more but we've got quite a few new things in KVM that help with this and I would expect dramatic improvements in the short term future.
- EC2 will dynamically adjust your CPU share as you try and use it, so you will _not_ get consistent results over any short period.
- EC2 is subject to other people's loads, which may be IO, CPU or network bound.
Going from there:
- Xen is slower any time you need dom0/domU coordination - it wouldn't surprise me to learn that there's some sort of coordination happening in accept() to tag the session through the upper dom0 firewalls.
- You don't describe what your backlog is on the listening socket, but you should make sure you're accepting as many as you can during your CPU share on EC2 -- your slice _will_ be interrupted at inopportune times.
Finally, EC2 is _lousy_ performance-wise, especially w/r/t disk IO - it doesn't sound like it, but if you're logging to disk after accept(), this could be the killer.
Tangentially -- you _might_ get better accept() performance if you turn ON syncookies, as then the handshake occurs basically at the kernel, and the accept() is only relevant _after_ the handshake is done. It's a bit hacky, but with _large_ numbers of connections, it can improve your performance a bit.
Thanks for the suggestions, here's some clarifications:
* I clarified what hardware I've tested in a comment (see below).
* I've run tests with my server ranging from 10 sec up to 10 minutes. They're consistently bad unfortunately.
* Interesting what you say about dom0/domU. I'm no Xen guru, but the culprit is probably something like that. I've been using a backlog of 1024 for the server tests (set both in Java land and sysctl.conf). The netperf are all defaults, both in terms of run time and backlog. Was actually trying to monitor the backlog somehow, but I'm not sure that's even possible in Linux?
* The server isn't doing anything disk IO-bound so this shouldn't be the case.
* syncookies seems like a good idea, I will definitely try that along with a lot bigger backlog and see if it makes any difference. I'll also try and see if netperf can be tweaked as well to provide a better, isolated test case.
Writing this off as Xen overhead would be such a shame, virtualization should not cause this much overhead. I'll continue investigating!
OTOH, what you will be trading off with syncookies, is that they subtly violate the TCP standard, and especially this will be a problem with the "server talks first" connections (like [E]SMTP, SSH): if your third ACK of the three way handshake gets lost on the way from client to the server, the canonical TCP implementation would have retransmitted the SYNACK from the server side. Except that the whole point of the syncookies is not to keep the state on the server side, i.e. there is nothing that can at all retransmit the SYNACK.
The HTTP, being by nature "client talks first" protocol, hides this problem (the ACK for the SYNACK will be effectively retransmitted because it will be part of the data segment), but I thought it might be useful to remind that turning syncookies on by default is not the standard modus operandi.
* On EC2 I've tried the c1.xlarge and the giant cc1.4xlarge. With cc1.4xlarge, I saw maybe a ~10% increase accept() rate.
* Two separate, virtualized servers at the office.
* A private, virtualized server on Rackspace was briefly tested as well.
A compilation of netperf results are available at https://gist.github.com/985475
Change the interrupt cpu affinity to split network interrupts over multiple cores. See:
I tried enabling RPS/RFS, which to my understanding, did this; load balance the interrupt handling among multiple cores. With this enabled, I saw little to no difference in connection rate.
But then again I'm guru, I might as well double check this.
Updated my little "action plan" in the original Serverfault question with this info.
This raises a few interesting cases in which the behavior of irqbalance may be non-intuitive. Most notably, cases in which a system has only one cache domain. Nominally these systems are only single cpu environments, but can also be found in multi-core environments in which the cores share an L2 cache. In these situations irqbalance will exit immediately, since there is no work that irqbalance can do which will improve interrupt handling performance.
There is really not enough information in the post to diagnose though, although someone might recognise te situation.
For FreeBSD you should also check if there are packet drops in sysctl net.inet.ip.intr_queue_drops
Our most amusing result was watching a t1.micro beat the pants off a cc.4xlarge.
For better or for worse (most would say better), segment limit checking is disabled in 64bit mode, so 64bit paravirtual Xen has to use the MMU to protect it's monitor. Basically, both the kernel and userspace actually run in ring3, but on different page tables. This means that expensive MMU updates (and TLB flushes) are required both on the way into and out of the kernel for every syscall.