Hacker News new | past | comments | ask | show | jobs | submit login
Low latency tuning guide (rigtorp.se)
294 points by ingve 9 months ago | hide | past | favorite | 98 comments



I have a hobby project that my target was following similar learning path, I could only recommend if you also work on your own server dont forget software side,perf (http://brendangregg.com/perf.html) is a god not just kernel side, as well as your own software, as part of my build I was always checking below command:

'perf stat -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses,faults,minor-faults,cs,migrations -r 3 nice taskset 0x01 ./myApplication -j XXX '

Additions I would have I have benefited:

* I use latest trimmed kernel, with no encryption, extra device etc...

* You might want to check RT kernels, finance & trading guys always good guide if you can

* Removed all boilerplate app stack from linux or built small one, I even now considering getting rid of network stack for my personal use

* Disable hyper-threading: I had a short living workers,this doesnt helped me for my case , you might want to validate first which one is best suited for your needs

* Check also your CPU capabilities (i.e. with avx2 & quad channel I get great improvements) and test them to make sure

* A system like this quickly get hot, watch temps, even short running tests might give you happy results but long term temps easily hit the wall that bios will not give a fuck but only throttle


I get that it was a hobby project so you could just be doing these optimizations for the heck of it. But if you do have measurements of how much each of these factors contributed (especially the two points about custom kernels), it would be useful.


Please don’t take offense at this my friend, it is genuinely constructive criticism. Slow down a little bit and re-read what you’re typing. I can’t understand half of what you’ve written here because it is so poorly done. It is a shame because I feel like you’re trying to share interesting information it’s just extremely hard to parse whatever it is you’re trying to say


Interesting. I found the comment rather succinct and to the point. Maybe you are not the target audience.


That might be true, maybe you could help me understand what this means:

“Additions I would have I have benefited“


I am afraid that this looks like a nitpicking. Even if you can't imagine what the original intent might be behind that sentence by the sentence alone, the contents of the bullet points make it clear what information they represent. So even if that sentence is dropped entirely, I wouldn't have had any difficulty understanding what the comment is about, and its useful information content.


Other tricks I have benefited from:


>I can’t understand half of what you’ve written here because it is so poorly done. It is a shame because I feel like you’re trying to share interesting information it’s just extremely hard to parse whatever it is you’re trying to say

While their writing is not perfect English, I had no problem understanding everything they wrote.


I’m really having trouble parsing this bit, could you explain for me?

“Additions I would have I have benefited“


I thought it was "Additions I would have benefited from" but then reading them it is clearly "Additions I have benefited from".


I know its not the exact same kind of concern as presented here, but I have recently found that one technique for achieving extremely precise timing of execution is to just sacrifice an entire high priority thread to a busy wait loop that checks timing conditions as fast as the CPU will cycle instructions. This has the most obvious advantage of being trivial to implement, even in high level languages that only expose the most basic of threading primitives.

Thinking strategically about this approach, modern server CPUs expose upwards of 64/128 threads, so if 1 of these had to be sacrificed completely to the gods of time, you are only looking at 1-2% of your overall resources spent for this objective. Then, you could reuse this timing service for sequencing work against the other 98-99% of resources. Going back just a few years, throwing away 12/25/50% of your compute resources for the sake of precise timing would have been a non-starter.

For reference, I find that this achieves timing errors measured in the 100-1000 nanoseconds range in my .NET Core projects when checking a trivial # of events. I have not bothered to optimize for large # of events yet, but I believe this will just be a pre-processing concern with an ordered queue of future events. I have found that this is precise enough timing to avoid the sort of logic you would otherwise need to use to calculate for time error and compensate on future iterations (e.g. in non-deterministic physics simulations or frame timing loops).


Yes, definitely turn off HT/SMT and use a single app thread per core with busy waiting. I'm working on a low latency application design guide exploring this more in depth.


I haven't measured this yet, but I question whether SMT would actually introduce any meaningful jitter into the timing loop. If my event is off by 10-100 nanoseconds, I probably don't care that much.

I am not actually sure how precise this approach could be in theory, so if the noise floor could be low enough for it to matter, then it's certainly a factor for some applications.

If we prove that SMT has a certain bounded impact, then it may be possible to say that for a certain range of applications you get a 2x feasibility bump because you can leave SMT enabled.


It shouldn't, that's the whole reason SMT exists. If there is detectable jitter that would be notable.

People have a bad taste in their mouth that was left circa ~2000(?) from some Intel parts with a pipeline that was too deep. Ever since that was fixed most workloads do see a 2x speedup when enabling SMT.


SMT sibling threads can definitely impact each other. It works great for common workloads. If you have a highly tuned workload with high IPC or want to trade off throughput for latency, disabling SMT can be a win. Disabling SMT also increases effective L1 and L2 cache which can be beneficial.


With busy polling you basically halve the SMT sibling thread's memory bandwidth. But yeah it might work well for a specific usecase anyway.


What kind of tasks would said thread be concerned with? Delegation and I/O?


I am currently using it to drive timing of frame generation and processing of UI events (i.e. animations, cursor flashing, etc) in a custom 2d graphics engine.

The API I currently have is:

int RegisterTimer(int afterMicroseconds, Action action)

void CancelTimer(int timerId)

It is really nice having this level of timing resolution and consistency in such a simple interface. I can just assume that whatever action I set up for delayed execution is running precisely when I wanted it to (in practical terms).


If understand that right you have a thread that only looks for now-open jobs and assigns them to workers? How do they receive their work?

Funny enough, what you're describing is basically the timer api that was used in warcraft 3 scripting.


Or the thread is doing the work directly?


In some cases the thread will, in others it will enqueue the event in an LMAX Disruptor for execution on one of the other available threads.


Keep in mind that by spinning, you're preventing the CPU from sleeping thus wasting a lot of energy.

At the very least, make sure you stop spinning when the game loses focus.


For reference, the domain of usage of this timer thread is in a server-side application. Clients do not have to run this. The server application handles many clients simultaneously, so cost of spinning is amortized across many users.


On application side I recommend using an instrumenting profiler that will let you know down to sub-microseconds what the code is doing. Tracy is a good choice (https://github.com/wolfpld/tracy) but there are others, e.g. Telemetry.


So, I just spent 2 hours checking this (tracy) out and I must say I am impressed. Here's a good video [1] from two years ago that shows its capabilities, and it's had 5 releases since then (check his YouTube channel for more recent vids of new features added). [1] https://www.youtube.com/watch?v=fB5B46lbapc

ps I once had a different username, but don't login often and forgot my pwd (with no email addr on file) :( I've been on HN for years. Not affiliated in any way with the tracy project or its author.


I can't tell what Tracy does because the documentation is so poor. Check out XRay for an older but still actively developed function tracing tool that generates traces which can be viewed in the Chrome trace viewer.

https://llvm.org/docs/XRayExample.html#debugging-with-xray


Did you see the pdf manual? I would actually have preferred something online for docs, but there is not a lack of documentation for Tracy.


This page [1] describes additional tuning parameters. In particular adjusting /proc/sys/kernel/sched_rt_runtime_us can be beneficial.

[1] https://access.redhat.com/documentation/en-us/red_hat_enterp...


For lowest latency applications I void avoid using RT priorities. Better to run each core 100% with busy waiting and if you do so with RT prio you can prevent the kernel from running tasks such as vmstat leading to lockup issues. Out of the box there is currently no way to 100% isolate cores in Linux. There is some ongoing work on that: https://lwn.net/Articles/816298/


One that seems rather important but missing is NIC receive coalesce. The feature delays frames to reduce the number of interrupts, thus increasing latency. Usually you want to turn this down as far as possible, but don’t set it to “1” because many NICs interpret that setting to mean “use black-box adaptive algorithm” and you don’t want that either.

It’s also quite helpful to run your Rx soft interrupts on the core that’s receiving the packet, but flow steering isn’t mentioned.


For a truly lowest latency in software application you need to avoid all context switches. Using interrupt driven IO adds to much overhead. You need to use polling and busy waiting. I'm working on a guide for this type of application design. If you are indeed using the Linux network stack, then yes adjusting NIC interrupt coalescing and interrupt affinity is useful.


True, but you have to rewrite your application to switch to a receive queue polling model, whereas "tuning" is stuff you can get for free with your existing program.


I've always been using kernel bypass for low latency networking. You can also use SO_BUSY_POLL with the Linux stack. I should at least mention this in the guide.


This is a cool article in the sense that it gives an idea of tuning that can be done on one extreme. While most applications won't need this level of tuning and some of them might be hurting if one isn't CPU bound, it is great to know which options exists.

Does anyone have further articles exploring the topic of os tuning for various types of applications? Maybe also for other OS, BSD/Win?


I have another article on virtual memory: https://rigtorp.se/virtual-memory/


You know you're doing low level stuff when the guide mentions things like

> Also consider using older CPU microcode without the microcode mitigations for CPU vulnerabilities.

I don't think I even know where to find older microcode for my particular CPU.


Using older microcode is just a matter of preventing your OS from uploading newer microcode during the boot process, and not updating the motherboard firmware to a newer version that bundles newer CPU microcode. Rolling back the motherboard firmware is usually not a supported option, and sometimes is actively prohibited by the system.


So how would I acquire a processor without the mitigations at this point? Presumably all the new ones are sold with them already installed.


CPUs do not have non-volatile memory for storing microcode updates. It has to be uploaded from the motherboard during the boot process, and the OS can optionally upload a newer version as part of their boot process. So the difficult thing is finding a matching motherboard that's running an old enough firmware version, or that can be rolled back to an older firmware version.


Socketed BIOS chips used to be somewhat common in gaming boards, which would always allow you to downgrade the firmware. I don't think they're found in workstation or server boards.


If the BIOS EEPROM chip is not a socketed DIP-8 package, often it uses a surface-mount SOIC-8 chip, which can be easily programmed with a SOIC-8 test clip and an I2C/SPI programmer. In case you don't have one, you can also use any single-board computer, like a Raspberry Pi and flashrom for the job.


> Don’t create any file backed writable memory mappings

If you have to create a writable file-backed memory mapping, open it in /dev/shm or /dev/hugepages. You can mount your own hugetlbfs volume with the right permissions.

Creating a big-ass single-writer ring buffer in a series of mapped hugetlb pages is a great way to give downstream code some breathing room. You can have multiple downstream programs, each on its own core, map it read-only, and start and stop them independently. Maintain a map of checkpoints farther back in the ring buffer, and they can pick up again, somewhere back, without missing anything.


I don't consider that file backed since they pull memory from the same pool as anonymous memory and not the page cache. The Linux kernel docs makes the distinction between file backed and anonymous memory. I think a better term would probably be page cached backed memory vs anonymous memory.


You might be interested in https://rigtorp.se/virtual-memory/ where I look deeper at the VM subsystem.


Would love to see a version of this but for ARM64 .. I'm assuming a few of these tips will be applicable, but I bet there's some ARM-specific things to be learned.


Most of these tips apply to all architectures. The only x86 specific parts are regarding CPU power management and turboboost.


My thought exactly. Would be great to read an Arm-specific version of this article.


This is a good list, but it seems to blur latency and jitter. For example, turbo modes can cause significant variability, threads running on other cores can cause your core to downclock, etc.

Reducing kernel scheduler interrupt rate can cause strange delay effects (presumably due to locking). Running it faster, but only on some cores, has been more beneficial IME. Depends on the latency vs jitter requirement of your computation I guess. If you're using SCHED_FIFO there is a complicated interaction with ticking being sometimes enabled (while theoretically tickless) at 1kHz to let kernel threads run...

Multithreaded apps should consider cache contention and CPU/memory placement. This does not always mean place all cores on the same socket, because you might need to get max memory bandwidth. Cf lstopo, numad/numactl, set_mempolicy(2). Making sure no other process can thrash the L3 or use memory bandwidth on your real time socket can also help. Node that numad(8) does page migrations, so it can cause jitter when that happens, but also reduce jitter for steady state.


With the right cooling setup I've been able to get Xeons to run permanently in turbo mode, kind of a back door overclock. You would have to experiment.

For lowest latency applications I void avoid using RT priorities. Better to run each core 100% with busy waiting and if you do so with RT prio you can prevent the kernel from running tasks such as vmstat leading to lockup issues. Out of the box there is currently no way to 100% isolate cores in Linux. There is some ongoing work on that: https://lwn.net/Articles/816298/

Oh yeah, the whole NUMA page migration stuff is not well documented. You'll also proactive compaction to deal with in the future: https://nitingupta.dev/post/proactive-compaction/


If you can make sure something else won't emit watts (other cores, something using AVX) then yeah turbo can work.

You can patch the kernel thread creation to not use specific CPUs... IDK of anyone publically maintaining a patchset, but look at https://github.com/torvalds/linux/blob/master/kernel/kthread... line 386.

We've only been using SCHED_FIFO on Linux, which is basically busywaiting if you only have one thread scheduled. Though I am interested in trying the SCHED_DEADLINE.

I hope someone adds a kcompact processor mask / process state disabling compaction.


There was also this recent patch https://lwn.net/Articles/816211/ to deal with kthread affinities. Even with isolcpus I find I still need to run pgrep -P 2 | xargs -i taskset -p -c 0 {} and deal with the workqueues.

Have you tried "A full task-isolation mode for the kernel": https://lwn.net/Articles/816298/ ?

Running only a single thread per core, I see no difference between SCHED_FIFO vs SCHED_OTHER. Except SCHED_FIFO can cause lockups if running 100% since cores are not completely isolated (ie vmstat timer and some other stuff).

Yes, it's annoying you cannot disable compaction. There is also work on pro-active compaction now: https://nitingupta.dev/post/proactive-compaction/


I like the Tosatti/WindRiver/Lameter patch. (Except the naming: _possible is the same meaning as _available, but they mean different things here depending if kthreads or user threads?) Just needs a proc interface.

With a patch like this you can force bottom half (interrupts), top half (kthreads) and user threads to all be on different cores.

The 'full task-isolation mode' seems wierd, because why should you drop out of isolation because of something outside your control like paging or TLB? Anyway, mlockall that. Its fine to be told I guess (except signals take time) but why drop out of isolation and risk glitches in re-isolating? It doesn't seem very polished.

Something else occured to me: you still have to be careful about data flow and having enough allocatable memory. E.g., a lot of memory local to core 0 will be consumed by buffer cache, it can be beneficial to drop it (free; numstat -ms; sync; echo 1 > /proc/sys/vm/drop_caches; free; numastat -ms).

I find it bizarre that all this THP compaction stuff is for workloads that are commonly run under a virtual machine, i.e. another layer of indirection.


As I understand 'full task-isolation mode' will prevent compaction, completely disable vmstat timer etc. So it provides additional isolation. Since you already switched into kernel mode, might as well deliver a signal to let you know it happened. If the signal is masked there should be no overhead at all except a branch to check the signal mask.


> The term latency in this context refers to ... The time between a request was submitted to a queue and the worker thread finished processing the request.

Since this is a tuning guide, I would have liked to see a separation of 3 attributes:

* Service-time: The actual time taken by the thread/process once it begins processing a request.

* Latency: The time spent by the request waiting to get processed (ex: languishing in a queue on the client/server side). This is when the request was latent.

* Response time: A combination of Service-time + Latency as recorded by the server. From a client's POV, this would additionally include the overhead from the network media etc.

Most performance models seem to isolate these separately to get a deeper sense of where the bottlenecks are. When there's just a single queue for everything then it makes sense to make the service-time as short as possible. But if you have multiple workload-based queues then you can do more interesting things.


This guide pretty much tells you how to make the Linux kernel interfere as little as possible with your application. How to instrument and what to measure would depend on the application.

I agree that measuring queuing delay and processing delay separately makes sense.


This is a great point. For the purposes of queuing theory analysis, some separate out latency from response time in which case response time is just service time + queue time, and latency is transit time before arriving at the queue.


threadirqs cmdline option might also make a difference:

threadirqs Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD.

It helped with my bluetooth issues and it's recommend for low-latency audio setups but unfortunatly I lack the knowledge about the tradeoffs. You also probably need to assign a higher priority to threads: https://alsa.opensrc.org/Rtirq - not sure if it's applicable besides audio.


Hint for web designers: subtle drop-shadow on text is a great way to simulate astigmatism. Thankfully, Firefox's Reader View simulates eyeglasses.


TIL I have astigmatism from an HN comment. Now the headlights + traffic lights thing at night makes sense too...


Thanks, I was wondering why that reading that link was like stabbing myself in the eyeballs :) Reader View is a great tip!



Another technique to dynamically manipulate task isolation is through the kernel's 'cpuset' interface (https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1...). Coupled with threadirqs, and RCU 'kthread-ification' with rcu_nocbs, can wield a large amount of flexibility in organizing real-time workloads. The userspace interface to cpusets via 'cpuset' makes it even more accessible (https://documentation.suse.com/sle-rt/15-SP1/html/SLE-RT-all...)


This is an excellent article. It is comprehensive and has simple explanations.


Great info.

You might also want to look into or write about DPDK, which achieves further speed-ups using polling mode drivers (instead of interrupt) and having the application directly process packets from the NIC (bypassing the kernel, which can be a bottleneck).

https://en.wikipedia.org/wiki/Data_Plane_Development_Kit


The author states:

>"Hyper-threading (HT) or Simultaneous multithreading (SMT) is a technology to maximize processor resource usage for workloads with low instructions per cycle (IPC)"

I had actually not heard this before regarding SMT. What is the significance of low IPC type in regards to the design of SMT? How does one determine if their workload is a low IPC workload?


HT share most execution units in the core. If your workload stalls a lot due to branch misprediction or memory access (low IPC) these units can be shared effectively. The Linux perf tool can be used to check IPC.


I would never seriously consider overclocking in a production environment.

There is probably buffer tuning you can do in the NIC driver also.


Many organisations run water cooled overclocked servers in production. I have not yet heard of any production use of sub-ambient cooling, but that would be awesome!


Liquid nitrogen cooling is a thing, where maximum per-thread performance is needed, but it provides at best 2x. There are a lot of other things to do first.


Yes and also subambient using heat pumps, but I have never seen it deployed in a data center. How would you deal with condensation?


Control the humidity, bring the temperature up if humidity control fails.


With a small heater to blow hot air over it.


That would work with a heat pump setup. Liquid nitrogen cooling is usually done by evaporating into the air directly on the processor package (as far as I know). So you would always have condensation inside the server case. Hmm, I guess a isolated duct for evaporated nitrogen to escape and heating the outside of that duct to ambient temp.


have everything that's below ambient be submerged in an engineered fluid?


Like who?


> Disable mitigations for CPU vulnerabilities

This begs a questions, which type of applications that are ok to disable mitigation?


Applications for which you can be sure that no other applications are running on the same box, either:

1. Because it's completely airgapped from the bigger internet and you control everything on it. Think complex embedded systems like radar HW control on military ships. The ships I was sailing on had 6 full racks per radar just for things like track maintenance and missile up/downlink scheduling. At some complexity level it becomes worth it to "lift" apps out of the embedded domain and make use of the facilities a bigger OS like Linux provides, but you often still have fairly tight realtime and performance requirements. A HFT server only connected to an exchange server could also count.

2. You have adequate security measures in other parts of your setup that, after carefully evaluating the risks, you decide to forego defense-in-depth in this part of the system.

There are not all that many fields where this type of microsecond chasing is all that worthwhile though. There are significant costs and risks involved and most web users won't ever notice a page load increase of of few microseconds. There are way more cost-effective performance improvements available for 99+% of companies out there CPU pinning and IRQ isolation.


For the majority of applications that follow this guide (or need to), the OS mitigations don't matter anyways:

1. They're running only trusted code.

2. L1 cache attacks aren't relevant if there's only one thread ever running on a given core.

3. Kernel-bypass networking means there are no system calls in the hot path anyways, so the OS mitigations won't even run in the first place.

If you're already doing all this it may be easier/better to look at using FPGAs instead. The advantage of this approach is that you don't need to procure a card with enough LUTs to house your design, and it allows the Ops team to contribute to performance.


If you've reduced the number of syscalls, the mitigations almost don't matter. Turning off the mitigations is more important if you're using the kernel stack for something like a tuned HAProxy installation which is mostly syscalls.


if you are in control of all the code you are running, are using desktop software that does not connect to the internet...


Rendering or benchmarking on an isolated network.


application on private network of private corporation ...


Can someone ELI5 why you'd want to do line rate packet capture in a low-latency way? Wouldn't you risk losing packets because you are trading off processing capacity for latency?


Measuring network delay, or just doing kernel bypass processing. Generally you know you won't loose packets because the rate is much lower than you can process.


Btw, for timing a lot of nics and yheir drivers (eg Intel) support hardware timestamps.


NO_HZ_FULL only works on Fedora and RHEL/CentOS 7+ unfortunately. Not sure why Debian derivatives haven't enabled this feature in the kernel.


If you are serious, you build your own kernel, and it doesn't matter what Debian or Red Hat does or doesn't.

But not everybody is so serious, and ease of deployment often matters. CentOS 7 is very old now. I hope you are not still using 6.


The "+" meant to indicate "or later", Centos 7 or later

And not building your own is less about ease of deployment (you could rebuild packages and add your own repo).

Building your own disconnects validation and maintenance. So maybe you don't build your own because you are serious.


It doesn't take much courage to field 7. CentOS 8 was released several months ago.

Persuading customers to run your special kernel is harder than getting them to run your program, so stuff that works with a stock kernel is an easier sell. Getting them to insmod your custom module is often possible when deploying a custom kernel isn't.


According to https://youtu.be/UUOM4KdaHkY?t=1190, the older the faster.


What Linux distro are you using? Couldn't find any of the util tools in Ubuntu's packages.


(Ubuntu 18.04) tuned-adm is in tuned, cpupower and perf are in linux-tools-common, irqbalance was already installed for me, and I also don't see tuna.


RHEL, but all the tools are open source.


Surprised to not see swappiness as a part of this guide.


Does anyone use swap on a system with real time response goals? Just disable swap.


Agreed, but it is worth mentioning.


Disabling swap only prevents major page faults on anonymous memory. You want to avoid all page faults by using mlockall (https://linux.die.net/man/2/mlockall). At that point swap settings doesn't matter. But yeah, disable swap anyway just in case.


My two additions:

* Unikernels * Intel clear linuxOS




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: