
How to achieve low latency with 10Gbps Ethernet - jgrahamc
https://blog.cloudflare.com/how-to-achieve-low-latency/
======
thrownaway2424
There have been a spate of articles lately that sum up to one fact: out of the
box, all Linux distributions are tuned for nothing in particular. With some
configs you can reduce your latency by an order of magnitude, increase your
MySQL ops by an order of magnitude, etc. This is one reason why I, as an
application-level programmer, am glad to work at a company big enough to have
someone else dealing with the platform on my behalf. I don't want to spend ten
years slowly discovering traps like default window size tuned for T1 speed,
default min_rto tuned for interplanetary traffic, not enough PIDs to support a
96-core machine, etc.

~~~
rdtsc
> all Linux distributions are tuned for nothing in particular.

Which is good, because often tuning means making trade-offs. A lot of those
things like polling and interrupt shuffling have side effects and might for
example negatively impact other scenarios -- presumably not everyone out there
is running single client-server ping-pong with small UDP packets.

Yes a lot of parameters have retained default values from back in the day when
system characteristics were different. Switching defaults drastically might
break existing code that relied on them. Perhaps those are better updated
during major distro version upgrades (RHEL 6->7 for ex).

In general though, I would say Linux is tuned for average throughput not
latency. Sometimes they go hand in hand, sometimes they don't. Over the years
there have been patches and forks related to operating on low latency
environments (stock trading, media processing, hardware control, graphics in
desktop environment). Those range from parameters tweaks in /proc to kernel
patches, to almost full kernel bypass via things like DPDK for Ethernet for
example.

~~~
thrownaway2424
Sure, they are all tradeoffs of some kind or another. It's crazy to think that
one person deploying a linux box on AWS would be aware of them all, though.
For example what is the default file-max on Ubuntu? Why?

~~~
rdtsc
Well you have different use cases and you have a large panel of nobs and
switches you can adjust. You need someone who knows what the knobs do and what
effect they have.

So perhaps there is a use case for wizard like configuration tool for
intermediate users. It can ask you questions like -- how many clients do you
have? How many concurrent connections. Would you rather have higher average
throughput (say for downloading files) or lower packet latency (say for audio
processing). Would you rather use your memory for a file cache for for your
application data. And so on.

Next step up -- a self tuning system. It monitors your usage pattern then
decides what settings to use. Monitors again, and adjusts a bit and so on.

~~~
gshx
I've always been somewhat wary of the self tuning systems. The challenge is
that the runtime profile of a well-utilized machine is not always predictable
especially if it is multi-tenant and/or running a mixture of different
workloads in a container like abstraction. I think a level higher is easier to
manage and understand, so, we're dealing with and managing multiple machines
or multiple containers of vm's as processing units. As a whole, that seems
more manageable as a self tuning system.

~~~
rdtsc
Tuning profiles could be suggested and an human would apply them. But yeah,
that was more of a pie in the sky idea.

------
ghshephard
It's an absolutely wonderful article - perhaps one of the best I've seen
regarding latency tweaking on a linux stack - but I don't understand why so
little about the final step, _" Since we have a Solarflare network card handy,
we can use the OpenOnload kernel bypass technology to skip the kernel network
stack all together:"_ was discussed. That was like the major punch line of the
entire article, and it was given a minor "oh, by the way, " sort of treatment.

Great article regardless. I'm wondering what the latency would have been if
they had _started_ with that step.

~~~
thrownaway2424
What are the tradeoffs? Is there any downside to this userspace network stack,
or is it pure win?

Edit: other than needing the thousand-dollar NIC, I mean.

~~~
pjc50
It's open-source but patent-encumbered, which is a strange state. It's not a
user-space TCP stack but a "card-space" one: it's offloading all the work into
the card.

~~~
neomantra
It's not offloading ALL the work onto the card. With 'offload' people often
think about feature like computing checksums in hardware, but that's not a big
deal on modern hardware. OpenOnload definitely leverages the cards
capabilities, notably packet buffer handling which is exposed in their ef_vi
API (there's a comment about this on the article's page); the flow steering
(mentioned in the article) is a big help too. OpenOnload is a user space
network stack built on top of ef_vi.

But so much the win come from being userspace and tunable. OpenOnload also
accelerate pipes, loopback UDP/TCP and epoll, all of which are unrelated to
their hardware. Indeed, having unaccelerated fds in the same epoll set as
accelerated fds kills performance because it needs to ask the kernel for the
status of the unaccelerated fds.

That said, I think you need to have a SolarFlare card for OpenOnload to work
at all, even if you just use it for local communication. I've not tried it
though.

------
StillBored
NUMA pin the DMA buffers too. I didn't see that in the article. With the app I
worked on getting the PCI bus/CPU/memory/interrupt all pinned to the "right"
node gave us a good 30% performance improvement.

Put another way, identify which socket the PCIe bridge and device are attached
to, then make sure the DMA buffers, and interrupt handler are pinned to that
node.

Also, don't touch the buffers from the other socket. Strangely enough touching
the DMA buffers from the other node just once was enough to cause a
performance fall off.

~~~
monocasa
Probably because the cache line changes from Exclusive to Owned in MOESI.

~~~
StillBored
Yah, that is the obvious choice, but it wasn't as clear cut. Especially with
the behavior we were seeing where the performance problems would remain long
after the lines were flushed from the CPU caches. I spent a fair amount of
time looking at Intel docs on snoop filters and the like and never really
satisfied myself that I knew exactly what was causing the problem.

------
hpcadmin007
10 GbE is often good enough for most use-cases. For same datacenter same-floor
gear and if there is a pressing requirement for low latency, deploying 1-2
generations old Infiniband (IB) gear is likely a smarter approach because the
cost point approaches 10 GbE and the latency is sub-microsecond-scale (0.5-0.7
us vs 20-30 for 10 GbE, which is a factor of 40-60x). Deploying OFED can be a
pain unless using a distro like Rocks, don't mix-match different vendors' gear
(stick to all Mellanox or all QLogic) and be sure the specific bios versions
of everything are supported. For Linux Rocks clusters support / deployment,
we've used the cool & helpful folks over at StackIQ (was Cluster Corp).

------
wtallis
Really important point: this is measuring baseline latency of sending one
packet at a time. They're not measuring latency under load where queueing
delays are possible. The picture there usually looks pretty different and some
of the settings they're changing here may be harmful or irrelevant to that
case.

~~~
brianwawok
Also why upstream smart batching is so important. If you only have 1 packet
shovel it off. But maybe wait 1-2 micros and see if you can combine 10 packets
(up to MTU) to decrease overall latency.

~~~
wtallis
It depends some on what layer you're talking about. Nagle's algorithm makes a
lot of sense in a lot of situations. But if you've got a complete packet
that's ready to encapsulate in an ethernet frame and send over the wire, you
only want to try for batching if you're actually hitting a CPU limit from
interrupt frequency or something like that. Unless you're dealing with WiFi
where extreme amounts of packet aggregation are necessary to get close to the
theoretical throughput.

~~~
brianwawok
Well if you use smart batching ala ( [http://mechanical-
sympathy.blogspot.com/2011/10/smart-batchi...](http://mechanical-
sympathy.blogspot.com/2011/10/smart-batching.html) ), you don't really have
the same kind of hit that you would with Nagle's or a more simple minded
batching algo.

------
fulafel
The extreme measures (bypassing kernel etc) allowed them to shave ~20 us off
the latency, which corresponds to a few kilometers. So this could indeed make
a difference for latency-bound apps talking to eachother inside a data center.

------
rasz_pl
step one should be 'we move drivers to userspace and bypass kernel
completelly'

edit: finished reading, its the last step :/ and ~halves best case they were
able to achieve with tweaks.

~~~
baruch
That's still not easy enough to do. UDP is simple enough but doing TCP in
those drivers is still a pain. There are DPDK with LWIP or with some BSD
tacked on top but so far I didn't find something that I can just take for a
ride in that area.

For the extreme cases you really want to control the packet allocation for
receive and make the buffers your app uses be also used for the network
transmit so you get a real zero-copy.

