'perf stat -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses,faults,minor-faults,cs,migrations -r 3 nice taskset 0x01 ./myApplication -j XXX '
Additions I would have I have benefited:
* I use latest trimmed kernel, with no encryption, extra device etc...
* You might want to check RT kernels, finance & trading guys always good guide if you can
* Removed all boilerplate app stack from linux or built small one, I even now considering getting rid of network stack for my personal use
* Disable hyper-threading: I had a short living workers,this doesnt helped me for my case , you might want to validate first which one is best suited for your needs
* Check also your CPU capabilities (i.e. with avx2 & quad channel I get great improvements) and test them to make sure
* A system like this quickly get hot, watch temps, even short running tests might give you happy results but long term temps easily hit the wall that bios will not give a fuck but only throttle
“Additions I would have I have benefited“
While their writing is not perfect English, I had no problem understanding everything they wrote.
Thinking strategically about this approach, modern server CPUs expose upwards of 64/128 threads, so if 1 of these had to be sacrificed completely to the gods of time, you are only looking at 1-2% of your overall resources spent for this objective. Then, you could reuse this timing service for sequencing work against the other 98-99% of resources. Going back just a few years, throwing away 12/25/50% of your compute resources for the sake of precise timing would have been a non-starter.
For reference, I find that this achieves timing errors measured in the 100-1000 nanoseconds range in my .NET Core projects when checking a trivial # of events. I have not bothered to optimize for large # of events yet, but I believe this will just be a pre-processing concern with an ordered queue of future events. I have found that this is precise enough timing to avoid the sort of logic you would otherwise need to use to calculate for time error and compensate on future iterations (e.g. in non-deterministic physics simulations or frame timing loops).
I am not actually sure how precise this approach could be in theory, so if the noise floor could be low enough for it to matter, then it's certainly a factor for some applications.
If we prove that SMT has a certain bounded impact, then it may be possible to say that for a certain range of applications you get a 2x feasibility bump because you can leave SMT enabled.
People have a bad taste in their mouth that was left circa ~2000(?) from some Intel parts with a pipeline that was too deep. Ever since that was fixed most workloads do see a 2x speedup when enabling SMT.
The API I currently have is:
int RegisterTimer(int afterMicroseconds, Action action)
void CancelTimer(int timerId)
It is really nice having this level of timing resolution and consistency in such a simple interface. I can just assume that whatever action I set up for delayed execution is running precisely when I wanted it to (in practical terms).
Funny enough, what you're describing is basically the timer api that was used in warcraft 3 scripting.
At the very least, make sure you stop spinning when the game loses focus.
ps I once had a different username, but don't login often and forgot my pwd (with no email addr on file) :( I've been on HN for years. Not affiliated in any way with the tracy project or its author.
It’s also quite helpful to run your Rx soft interrupts on the core that’s receiving the packet, but flow steering isn’t mentioned.
Does anyone have further articles exploring the topic of os tuning for various types of applications? Maybe also for other OS, BSD/Win?
> Also consider using older CPU microcode without the microcode mitigations for CPU vulnerabilities.
I don't think I even know where to find older microcode for my particular CPU.
If you have to create a writable file-backed memory mapping, open it in /dev/shm or /dev/hugepages. You can mount your own hugetlbfs volume with the right permissions.
Creating a big-ass single-writer ring buffer in a series of mapped hugetlb pages is a great way to give downstream code some breathing room. You can have multiple downstream programs, each on its own core, map it read-only, and start and stop them independently. Maintain a map of checkpoints farther back in the ring buffer, and they can pick up again, somewhere back, without missing anything.
Reducing kernel scheduler interrupt rate can cause strange delay effects (presumably due to locking). Running it faster, but only on some cores, has been more beneficial IME. Depends on the latency vs jitter requirement of your computation I guess. If you're using SCHED_FIFO there is a complicated interaction with ticking being sometimes enabled (while theoretically tickless) at 1kHz to let kernel threads run...
Multithreaded apps should consider cache contention and CPU/memory placement. This does not always mean place all cores on the same socket, because you might need to get max memory bandwidth. Cf lstopo, numad/numactl, set_mempolicy(2). Making sure no other process can thrash the L3 or use memory bandwidth on your real time socket can also help. Node that numad(8) does page migrations, so it can cause jitter when that happens, but also reduce jitter for steady state.
For lowest latency applications I void avoid using RT priorities. Better to run each core 100% with busy waiting and if you do so with RT prio you can prevent the kernel from running tasks such as vmstat leading to lockup issues. Out of the box there is currently no way to 100% isolate cores in Linux. There is some ongoing work on that: https://lwn.net/Articles/816298/
Oh yeah, the whole NUMA page migration stuff is not well documented. You'll also proactive compaction to deal with in the future: https://nitingupta.dev/post/proactive-compaction/
You can patch the kernel thread creation to not use specific CPUs... IDK of anyone publically maintaining a patchset, but look at https://github.com/torvalds/linux/blob/master/kernel/kthread... line 386.
We've only been using SCHED_FIFO on Linux, which is basically busywaiting if you only have one thread scheduled. Though I am interested in trying the SCHED_DEADLINE.
I hope someone adds a kcompact processor mask / process state disabling compaction.
Have you tried "A full task-isolation mode for the kernel": https://lwn.net/Articles/816298/ ?
Running only a single thread per core, I see no difference between SCHED_FIFO vs SCHED_OTHER. Except SCHED_FIFO can cause lockups if running 100% since cores are not completely isolated (ie vmstat timer and some other stuff).
Yes, it's annoying you cannot disable compaction. There is also work on pro-active compaction now: https://nitingupta.dev/post/proactive-compaction/
With a patch like this you can force bottom half (interrupts), top half (kthreads) and user threads to all be on different cores.
The 'full task-isolation mode' seems wierd, because why should you drop out of isolation because of something outside your control like paging or TLB? Anyway, mlockall that. Its fine to be told I guess (except signals take time) but why drop out of isolation and risk glitches in re-isolating? It doesn't seem very polished.
Something else occured to me: you still have to be careful about data flow and having enough allocatable memory. E.g., a lot of memory local to core 0 will be consumed by buffer cache, it can be beneficial to drop it (free; numstat -ms; sync; echo 1 > /proc/sys/vm/drop_caches; free; numastat -ms).
I find it bizarre that all this THP compaction stuff is for workloads that are commonly run under a virtual machine, i.e. another layer of indirection.
Since this is a tuning guide, I would have liked to see a separation of 3 attributes:
* Service-time: The actual time taken by the thread/process once it begins processing a request.
* Latency: The time spent by the request waiting to get processed (ex: languishing in a queue on the client/server side). This is when the request was latent.
* Response time: A combination of Service-time + Latency as recorded by the server. From a client's POV, this would additionally include the overhead from the network media etc.
Most performance models seem to isolate these separately to get a deeper sense of where the bottlenecks are. When there's just a single queue for everything then it makes sense to make the service-time as short as possible. But if you have multiple workload-based queues then you can do more interesting things.
I agree that measuring queuing delay and processing delay separately makes sense.
threadirqs Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD.
It helped with my bluetooth issues and it's recommend for low-latency audio setups but unfortunatly I lack the knowledge about the tradeoffs. You also probably need to assign a higher priority to threads: https://alsa.opensrc.org/Rtirq - not sure if it's applicable besides audio.
You might also want to look into or write about DPDK, which achieves further speed-ups using polling mode drivers (instead of interrupt) and having the application directly process packets from the NIC (bypassing the kernel, which can be a bottleneck).
>"Hyper-threading (HT) or Simultaneous multithreading (SMT) is a technology to maximize processor resource usage for workloads with low instructions per cycle (IPC)"
I had actually not heard this before regarding SMT. What is the significance of low IPC type in regards to the design of SMT? How does one determine if their workload is a low IPC workload?
There is probably buffer tuning you can do in the NIC driver also.
This begs a questions, which type of applications that are ok to disable mitigation?
1. Because it's completely airgapped from the bigger internet and you control everything on it. Think complex embedded systems like radar HW control on military ships. The ships I was sailing on had 6 full racks per radar just for things like track maintenance and missile up/downlink scheduling. At some complexity level it becomes worth it to "lift" apps out of the embedded domain and make use of the facilities a bigger OS like Linux provides, but you often still have fairly tight realtime and performance requirements. A HFT server only connected to an exchange server could also count.
2. You have adequate security measures in other parts of your setup that, after carefully evaluating the risks, you decide to forego defense-in-depth in this part of the system.
There are not all that many fields where this type of microsecond chasing is all that worthwhile though. There are significant costs and risks involved and most web users won't ever notice a page load increase of of few microseconds. There are way more cost-effective performance improvements available for 99+% of companies out there CPU pinning and IRQ isolation.
1. They're running only trusted code.
2. L1 cache attacks aren't relevant if there's only one thread ever running on a given core.
3. Kernel-bypass networking means there are no system calls in the hot path anyways, so the OS mitigations won't even run in the first place.
If you're already doing all this it may be easier/better to look at using FPGAs instead. The advantage of this approach is that you don't need to procure a card with enough LUTs to house your design, and it allows the Ops team to contribute to performance.
But not everybody is so serious, and ease of deployment often matters. CentOS 7 is very old now. I hope you are not still using 6.
And not building your own is less about ease of deployment (you could rebuild packages and add your own repo).
Building your own disconnects validation and maintenance.
So maybe you don't build your own because you are serious.
Persuading customers to run your special kernel is harder than getting them to run your program, so stuff that works with a stock kernel is an easier sell. Getting them to insmod your custom module is often possible when deploying a custom kernel isn't.
* Intel clear linuxOS