
Low latency tuning guide - ingve
https://rigtorp.se/low-latency-guide/
======
hrgiger
I have a hobby project that my target was following similar learning path, I
could only recommend if you also work on your own server dont forget software
side,perf
([http://brendangregg.com/perf.html](http://brendangregg.com/perf.html)) is a
god not just kernel side, as well as your own software, as part of my build I
was always checking below command:

'perf stat -e task-clock,cycles,instructions,cache-references,cache-
misses,branches,branch-misses,faults,minor-faults,cs,migrations -r 3 nice
taskset 0x01 ./myApplication -j XXX '

Additions I would have I have benefited:

* I use latest trimmed kernel, with no encryption, extra device etc...

* You might want to check RT kernels, finance & trading guys always good guide if you can

* Removed all boilerplate app stack from linux or built small one, I even now considering getting rid of network stack for my personal use

* Disable hyper-threading: I had a short living workers,this doesnt helped me for my case , you might want to validate first which one is best suited for your needs

* Check also your CPU capabilities (i.e. with avx2 & quad channel I get great improvements) and test them to make sure

* A system like this quickly get hot, watch temps, even short running tests might give you happy results but long term temps easily hit the wall that bios will not give a fuck but only throttle

~~~
zarathustreal
Please don’t take offense at this my friend, it is genuinely constructive
criticism. Slow down a little bit and re-read what you’re typing. I can’t
understand half of what you’ve written here because it is so poorly done. It
is a shame because I feel like you’re trying to share interesting information
it’s just extremely hard to parse whatever it is you’re trying to say

~~~
drivebycomment
Interesting. I found the comment rather succinct and to the point. Maybe you
are not the target audience.

~~~
zarathustreal
That might be true, maybe you could help me understand what this means:

“Additions I would have I have benefited“

~~~
drivebycomment
I am afraid that this looks like a nitpicking. Even if you can't imagine what
the original intent might be behind that sentence by the sentence alone, the
contents of the bullet points make it clear what information they represent.
So even if that sentence is dropped entirely, I wouldn't have had any
difficulty understanding what the comment is about, and its useful information
content.

------
bob1029
I know its not the exact same kind of concern as presented here, but I have
recently found that one technique for achieving extremely precise timing of
execution is to just sacrifice an entire high priority thread to a busy wait
loop that checks timing conditions as fast as the CPU will cycle instructions.
This has the most obvious advantage of being trivial to implement, even in
high level languages that only expose the most basic of threading primitives.

Thinking strategically about this approach, modern server CPUs expose upwards
of 64/128 threads, so if 1 of these had to be sacrificed completely to the
gods of time, you are only looking at 1-2% of your overall resources spent for
this objective. Then, you could reuse this timing service for sequencing work
against the other 98-99% of resources. Going back just a few years, throwing
away 12/25/50% of your compute resources for the sake of precise timing would
have been a non-starter.

For reference, I find that this achieves timing errors measured in the
100-1000 nanoseconds range in my .NET Core projects when checking a trivial #
of events. I have not bothered to optimize for large # of events yet, but I
believe this will just be a pre-processing concern with an ordered queue of
future events. I have found that this is precise enough timing to avoid the
sort of logic you would otherwise need to use to calculate for time error and
compensate on future iterations (e.g. in non-deterministic physics simulations
or frame timing loops).

~~~
rigtorp
Yes, definitely turn off HT/SMT and use a single app thread per core with busy
waiting. I'm working on a low latency application design guide exploring this
more in depth.

~~~
bob1029
I haven't measured this yet, but I question whether SMT would actually
introduce any meaningful jitter into the timing loop. If my event is off by
10-100 nanoseconds, I probably don't care that much.

I am not actually sure how precise this approach could be in theory, so if the
noise floor could be low enough for it to matter, then it's certainly a factor
for some applications.

If we prove that SMT has a certain bounded impact, then it may be possible to
say that for a certain range of applications you get a 2x feasibility bump
because you can leave SMT enabled.

~~~
R0b0t1
It shouldn't, that's the whole reason SMT exists. If there is detectable
jitter that would be notable.

People have a bad taste in their mouth that was left circa ~2000(?) from some
Intel parts with a pipeline that was too deep. Ever since that was fixed most
workloads do see a 2x speedup when enabling SMT.

~~~
rigtorp
SMT sibling threads can definitely impact each other. It works great for
common workloads. If you have a highly tuned workload with high IPC or want to
trade off throughput for latency, disabling SMT can be a win. Disabling SMT
also increases effective L1 and L2 cache which can be beneficial.

------
Torkel
On application side I recommend using an instrumenting profiler that will let
you know down to sub-microseconds what the code is doing. Tracy is a good
choice ([https://github.com/wolfpld/tracy](https://github.com/wolfpld/tracy))
but there are others, e.g. Telemetry.

~~~
jeffbee
I can't tell what Tracy does because the documentation is so poor. Check out
XRay for an older but still actively developed function tracing tool that
generates traces which can be viewed in the Chrome trace viewer.

[https://llvm.org/docs/XRayExample.html#debugging-with-
xray](https://llvm.org/docs/XRayExample.html#debugging-with-xray)

~~~
Torkel
Did you see the pdf manual? I would actually have preferred something online
for docs, but there is not a lack of documentation for Tracy.

------
sild
This page [1] describes additional tuning parameters. In particular adjusting
/proc/sys/kernel/sched_rt_runtime_us can be beneficial.

[1] [https://access.redhat.com/documentation/en-
us/red_hat_enterp...](https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/real_time_throttling)

~~~
rigtorp
For lowest latency applications I void avoid using RT priorities. Better to
run each core 100% with busy waiting and if you do so with RT prio you can
prevent the kernel from running tasks such as vmstat leading to lockup issues.
Out of the box there is currently no way to 100% isolate cores in Linux. There
is some ongoing work on that:
[https://lwn.net/Articles/816298/](https://lwn.net/Articles/816298/)

------
jeffbee
One that seems rather important but missing is NIC receive coalesce. The
feature delays frames to reduce the number of interrupts, thus increasing
latency. Usually you want to turn this down as far as possible, but don’t set
it to “1” because many NICs interpret that setting to mean “use black-box
adaptive algorithm” and you don’t want that either.

It’s also quite helpful to run your Rx soft interrupts on the core that’s
receiving the packet, but flow steering isn’t mentioned.

~~~
rigtorp
For a truly lowest latency in software application you need to avoid all
context switches. Using interrupt driven IO adds to much overhead. You need to
use polling and busy waiting. I'm working on a guide for this type of
application design. If you are indeed using the Linux network stack, then yes
adjusting NIC interrupt coalescing and interrupt affinity is useful.

~~~
jeffbee
True, but you have to rewrite your application to switch to a receive queue
polling model, whereas "tuning" is stuff you can get for free with your
existing program.

~~~
rigtorp
I've always been using kernel bypass for low latency networking. You can also
use SO_BUSY_POLL with the Linux stack. I should at least mention this in the
guide.

------
PhDuck
This is a cool article in the sense that it gives an idea of tuning that can
be done on one extreme. While most applications won't need this level of
tuning and some of them might be hurting if one isn't CPU bound, it is great
to know which options exists.

Does anyone have further articles exploring the topic of os tuning for various
types of applications? Maybe also for other OS, BSD/Win?

~~~
rigtorp
I have another article on virtual memory: [https://rigtorp.se/virtual-
memory/](https://rigtorp.se/virtual-memory/)

------
WJW
You know you're doing low level stuff when the guide mentions things like

> Also consider using older CPU microcode without the microcode mitigations
> for CPU vulnerabilities.

I don't think I even know where to find older microcode for my particular CPU.

~~~
wtallis
Using older microcode is just a matter of preventing your OS from uploading
newer microcode during the boot process, and not updating the motherboard
firmware to a newer version that bundles newer CPU microcode. Rolling back the
motherboard firmware is usually not a supported option, and sometimes is
actively prohibited by the system.

~~~
WJW
So how would I acquire a processor without the mitigations at this point?
Presumably all the new ones are sold with them already installed.

~~~
wtallis
CPUs do not have non-volatile memory for storing microcode updates. It has to
be uploaded from the motherboard during the boot process, and the OS can
optionally upload a newer version as part of their boot process. So the
difficult thing is finding a matching _motherboard_ that's running an old
enough firmware version, or that can be rolled back to an older firmware
version.

------
ncmncm
> _Don’t create any file backed writable memory mappings_

If you have to create a writable file-backed memory mapping, open it in
/dev/shm or /dev/hugepages. You can mount your own hugetlbfs volume with the
right permissions.

Creating a big-ass single-writer ring buffer in a series of mapped hugetlb
pages is a great way to give downstream code some breathing room. You can have
multiple downstream programs, each on its own core, map it read-only, and
start and stop them independently. Maintain a map of checkpoints farther back
in the ring buffer, and they can pick up again, somewhere back, without
missing anything.

~~~
rigtorp
I don't consider that file backed since they pull memory from the same pool as
anonymous memory and not the page cache. The Linux kernel docs makes the
distinction between file backed and anonymous memory. I think a better term
would probably be page cached backed memory vs anonymous memory.

------
fit2rule
Would love to see a version of this but for ARM64 .. I'm assuming a few of
these tips will be applicable, but I bet there's some ARM-specific things to
be learned.

~~~
rigtorp
Most of these tips apply to all architectures. The only x86 specific parts are
regarding CPU power management and turboboost.

------
angry_octet
This is a good list, but it seems to blur latency and jitter. For example,
turbo modes can cause significant variability, threads running on other cores
can cause your core to downclock, etc.

Reducing kernel scheduler interrupt rate can cause strange delay effects
(presumably due to locking). Running it faster, but only on some cores, has
been more beneficial IME. Depends on the latency vs jitter requirement of your
computation I guess. If you're using SCHED_FIFO there is a complicated
interaction with ticking being sometimes enabled (while theoretically
tickless) at 1kHz to let kernel threads run...

Multithreaded apps should consider cache contention and CPU/memory placement.
This does not always mean place all cores on the same socket, because you
might need to get max memory bandwidth. Cf lstopo, numad/numactl,
set_mempolicy(2). Making sure no other process can thrash the L3 or use memory
bandwidth on your real time socket can also help. Node that numad(8) does page
migrations, so it can cause jitter when that happens, but also reduce jitter
for steady state.

~~~
rigtorp
With the right cooling setup I've been able to get Xeons to run permanently in
turbo mode, kind of a back door overclock. You would have to experiment.

For lowest latency applications I void avoid using RT priorities. Better to
run each core 100% with busy waiting and if you do so with RT prio you can
prevent the kernel from running tasks such as vmstat leading to lockup issues.
Out of the box there is currently no way to 100% isolate cores in Linux. There
is some ongoing work on that:
[https://lwn.net/Articles/816298/](https://lwn.net/Articles/816298/)

Oh yeah, the whole NUMA page migration stuff is not well documented. You'll
also proactive compaction to deal with in the future:
[https://nitingupta.dev/post/proactive-
compaction/](https://nitingupta.dev/post/proactive-compaction/)

~~~
angry_octet
If you can make sure something else won't emit watts (other cores, something
using AVX) then yeah turbo can work.

You can patch the kernel thread creation to not use specific CPUs... IDK of
anyone publically maintaining a patchset, but look at
[https://github.com/torvalds/linux/blob/master/kernel/kthread...](https://github.com/torvalds/linux/blob/master/kernel/kthread.c)
line 386.

We've only been using SCHED_FIFO on Linux, which is basically busywaiting if
you only have one thread scheduled. Though I am interested in trying the
SCHED_DEADLINE.

I hope someone adds a kcompact processor mask / process state disabling
compaction.

~~~
rigtorp
There was also this recent patch
[https://lwn.net/Articles/816211/](https://lwn.net/Articles/816211/) to deal
with kthread affinities. Even with isolcpus I find I still need to run pgrep
-P 2 | xargs -i taskset -p -c 0 {} and deal with the workqueues.

Have you tried "A full task-isolation mode for the kernel":
[https://lwn.net/Articles/816298/](https://lwn.net/Articles/816298/) ?

Running only a single thread per core, I see no difference between SCHED_FIFO
vs SCHED_OTHER. Except SCHED_FIFO can cause lockups if running 100% since
cores are not completely isolated (ie vmstat timer and some other stuff).

Yes, it's annoying you cannot disable compaction. There is also work on pro-
active compaction now: [https://nitingupta.dev/post/proactive-
compaction/](https://nitingupta.dev/post/proactive-compaction/)

~~~
angry_octet
I like the Tosatti/WindRiver/Lameter patch. (Except the naming: _possible is
the same meaning as _available, but they mean different things here depending
if kthreads or user threads?) Just needs a proc interface.

With a patch like this you can force bottom half (interrupts), top half
(kthreads) and user threads to all be on different cores.

The 'full task-isolation mode' seems wierd, because why should you drop out of
isolation because of something outside your control like paging or TLB?
Anyway, mlockall that. Its fine to be told I guess (except signals take time)
but why drop out of isolation and risk glitches in re-isolating? It doesn't
seem very polished.

Something else occured to me: you still have to be careful about data flow and
having enough allocatable memory. E.g., a lot of memory local to core 0 will
be consumed by buffer cache, it can be beneficial to drop it (free; numstat
-ms; sync; echo 1 > /proc/sys/vm/drop_caches; free; numastat -ms).

I find it bizarre that all this THP compaction stuff is for workloads that are
commonly run under a virtual machine, i.e. another layer of indirection.

~~~
rigtorp
As I understand 'full task-isolation mode' will prevent compaction, completely
disable vmstat timer etc. So it provides additional isolation. Since you
already switched into kernel mode, might as well deliver a signal to let you
know it happened. If the signal is masked there should be no overhead at all
except a branch to check the signal mask.

------
fizwhiz
_> The term latency in this context refers to ... The time between a request
was submitted to a queue and the worker thread finished processing the
request._

Since this is a tuning guide, I would have liked to see a separation of 3
attributes:

* Service-time: The actual time taken by the thread/process once it _begins_ processing a request.

* Latency: The time spent by the request waiting to get processed (ex: languishing in a queue on the client/server side). This is when the request was _latent_.

* Response time: A combination of Service-time + Latency as recorded by the server. From a client's POV, this would additionally include the overhead from the network media etc.

Most performance models seem to isolate these separately to get a deeper sense
of where the bottlenecks are. When there's just a single queue for everything
then it makes sense to make the service-time as short as possible. But if you
have multiple workload-based queues then you can do more interesting things.

~~~
rigtorp
This guide pretty much tells you how to make the Linux kernel interfere as
little as possible with your application. How to instrument and what to
measure would depend on the application.

I agree that measuring queuing delay and processing delay separately makes
sense.

------
nisa
threadirqs cmdline option might also make a difference:

threadirqs Force threading of all interrupt handlers except those marked
explicitly IRQF_NO_THREAD.

It helped with my bluetooth issues and it's recommend for low-latency audio
setups but unfortunatly I lack the knowledge about the tradeoffs. You also
probably need to assign a higher priority to threads:
[https://alsa.opensrc.org/Rtirq](https://alsa.opensrc.org/Rtirq) \- not sure
if it's applicable besides audio.

~~~
mrob
Hint for web designers: subtle drop-shadow on text is a great way to simulate
astigmatism. Thankfully, Firefox's Reader View simulates eyeglasses.

~~~
gavinray
TIL I have astigmatism from an HN comment. Now the headlights + traffic lights
thing at night makes sense too...

------
halz
Another technique to dynamically manipulate task isolation is through the
kernel's 'cpuset' interface ([https://www.kernel.org/doc/html/latest/admin-
guide/cgroup-v1...](https://www.kernel.org/doc/html/latest/admin-
guide/cgroup-v1/cpusets.html)). Coupled with threadirqs, and RCU 'kthread-
ification' with rcu_nocbs, can wield a large amount of flexibility in
organizing real-time workloads. The userspace interface to cpusets via
'cpuset' makes it even more accessible ([https://documentation.suse.com/sle-
rt/15-SP1/html/SLE-RT-all...](https://documentation.suse.com/sle-
rt/15-SP1/html/SLE-RT-all/cha-shielding-model.html))

------
drudru11
This is an excellent article. It is comprehensive and has simple explanations.

------
mike00632
Great info.

You might also want to look into or write about DPDK, which achieves further
speed-ups using polling mode drivers (instead of interrupt) and having the
application directly process packets from the NIC (bypassing the kernel, which
can be a bottleneck).

[https://en.wikipedia.org/wiki/Data_Plane_Development_Kit](https://en.wikipedia.org/wiki/Data_Plane_Development_Kit)

------
bogomipz
The author states:

>"Hyper-threading (HT) or Simultaneous multithreading (SMT) is a technology to
maximize processor resource usage for workloads with low instructions per
cycle (IPC)"

I had actually not heard this before regarding SMT. What is the significance
of low IPC type in regards to the design of SMT? How does one determine if
their workload is a low IPC workload?

~~~
rigtorp
HT share most execution units in the core. If your workload stalls a lot due
to branch misprediction or memory access (low IPC) these units can be shared
effectively. The Linux perf tool can be used to check IPC.

------
annoyingnoob
I would never seriously consider overclocking in a production environment.

There is probably buffer tuning you can do in the NIC driver also.

~~~
rigtorp
Many organisations run water cooled overclocked servers in production. I have
not yet heard of any production use of sub-ambient cooling, but that would be
awesome!

~~~
ncmncm
Liquid nitrogen cooling is a thing, where maximum per-thread performance is
needed, but it provides at best 2x. There are a lot of other things to do
first.

~~~
rigtorp
Yes and also subambient using heat pumps, but I have never seen it deployed in
a data center. How would you deal with condensation?

~~~
angry_octet
With a small heater to blow hot air over it.

~~~
rigtorp
That would work with a heat pump setup. Liquid nitrogen cooling is usually
done by evaporating into the air directly on the processor package (as far as
I know). So you would always have condensation inside the server case. Hmm, I
guess a isolated duct for evaporated nitrogen to escape and heating the
outside of that duct to ambient temp.

------
letientai299
> Disable mitigations for CPU vulnerabilities

This begs a questions, which type of applications that are ok to disable
mitigation?

~~~
steventhedev
For the majority of applications that follow this guide (or need to), the OS
mitigations don't matter anyways:

1\. They're running only trusted code.

2\. L1 cache attacks aren't relevant if there's only one thread ever running
on a given core.

3\. Kernel-bypass networking means there are no system calls in the hot path
anyways, so the OS mitigations won't even run in the first place.

If you're already doing all this it may be easier/better to look at using
FPGAs instead. The advantage of this approach is that you don't need to
procure a card with enough LUTs to house your design, and it allows the Ops
team to contribute to performance.

~~~
toast0
If you've reduced the number of syscalls, the mitigations almost don't matter.
Turning off the mitigations is more important if you're using the kernel stack
for something like a tuned HAProxy installation which is mostly syscalls.

------
fulafel
Can someone ELI5 why you'd want to do line rate packet capture in a low-
latency way? Wouldn't you risk losing packets because you are trading off
processing capacity for latency?

~~~
angry_octet
Measuring network delay, or just doing kernel bypass processing. Generally you
know you won't loose packets because the rate is much lower than you can
process.

~~~
fulafel
Btw, for timing a lot of nics and yheir drivers (eg Intel) support hardware
timestamps.

------
en4bz
NO_HZ_FULL only works on Fedora and RHEL/CentOS 7+ unfortunately. Not sure why
Debian derivatives haven't enabled this feature in the kernel.

~~~
ncmncm
If you are serious, you build your own kernel, and it doesn't matter what
Debian or Red Hat does or doesn't.

But not everybody is so serious, and ease of deployment often matters. CentOS
7 is very old now. I hope you are not still using 6.

~~~
froh
The "+" meant to indicate "or later", Centos 7 or later

And not building your own is less about ease of deployment (you could rebuild
packages and add your own repo).

Building your own disconnects validation and maintenance. So maybe you don't
build your own _because_ you are serious.

~~~
ncmncm
It doesn't take much courage to field 7. CentOS 8 was released several months
ago.

Persuading customers to run your special kernel is harder than getting them to
run your program, so stuff that works with a stock kernel is an easier sell.
Getting them to insmod your custom module is often possible when deploying a
custom kernel isn't.

------
z3t4
What Linux distro are you using? Couldn't find any of the util tools in
Ubuntu's packages.

~~~
Izkata
(Ubuntu 18.04) tuned-adm is in tuned, cpupower and perf are in linux-tools-
common, irqbalance was already installed for me, and I also don't see tuna.

------
29athrowaway
Surprised to not see swappiness as a part of this guide.

~~~
angry_octet
Does anyone use swap on a system with real time response goals? Just disable
swap.

~~~
29athrowaway
Agreed, but it is worth mentioning.

------
hathym
My two additions:

* Unikernels * Intel clear linuxOS

