The issue is small packet performance. You can isolate it pretty easily with netperf TCP_RR. In order to send a packet, the hypervisor needs to switch from the guest, to the hypervisor, and in the case of Xen, to domain-0. These switches are very expensive.
Normally, you don't notice this because almost all I/O in hypervisors is done with a data structure known as a lockless ring-queue. These data structures are extremely efficient at batching requests in such a way as to minimize the overhead of world switching by trying to never do it.
But TCP_RR is the pathological test case for this. No matter how smart the data structure is, you still end up taking one exit per packet. In particular, with small packets, you've got multiple world switches to move a very small number of bytes (usually around 64).
There are ways to improve this (using things like adaptive polling) but this is still an area of active development. I don't follow Xen too closely any more but we've got quite a few new things in KVM that help with this and I would expect dramatic improvements in the short term future.