If in the future (TM) using CPU performance counters for every process becomes really "free" (as in "gratis" or "cheap"), the OS could report bad performing processes because the reasons exposed in the article (low IPC indicating poor memory access patterns, unoptimized code, code using too small buffers for I/O -causing system performance degradation because excessive kernel processing time because-, etc.), showing the user that despite having high CPU usage, the CPU is not getting enough work done (in that sense I could agree with the article).
Your point that we can't utilize those cycles, so therefore they are utilized... Well, I have a production instance running with an IPC of 0.11 (yes, seriously), and system-wide %CPU of 60, and idle hyperthreads. If I added a high IPC workload to those idle hyperthreads, do you think I could recover those stalled cycles? Probably. So were those cycles utilized or not? :-)
As for using CPU counters for free: sure, they basically are. Just instructions to read & configure (WRMSR, RDMSR, RDPMC), and the OS can do it on kernel context switch, so that they can be associated per-process. It'd cost very very little: adding a few instructions to each context switch, and some members to task_struct. I'd be a bit annoyed if the kernel started doing this everywhere as it'd tie up some PMCs (physical resource), so I'd rather that behavior be configurable.
RDPMC is fast. WRMSR is amazingly slow.
Put another way, when you're using WRMSR, your IPC (CPU usage?) might be around .002.
Great and timely article, by the way, though it disappointingly lacks a video of you yelling at CPUs.
I'll back Brendan and claim that using hardware CPU counters on modern server processors is basically free. They are internal to the processor, and essentially don't utilize any resources shared with program execution. Maybe you could contrive a case where they cause faster thermal throttling, but I haven't seen it. For a simple count of events that is started and stopped at context switches (and when you are counting fewer events than available hardware counters) I'd be surprised if you even saw 1% overhead.
If you are trying to multiplex many more events than you have physical counters, the overhead of "reprogramming" the counters might cost you single digit overhead. If you are trying to save all stack traces, record all branches, or monitor all memory accesses you can manage to create a more significant slowdown, but even then you can usually lower this something acceptable by sampling.
So yes --- you probably should be using perf far more! Or VTune, or likwid (https://github.com/RRZE-HPC/likwid).
exactly what i was going to tell you.
stick with that next time and stay away from 'wrong'.
On the other hand, CPU utilisation is used to scale CPU frequency on some systems. This generally works OK but if a workload is spending most of its time blocking on the memory subsystem then scaling up the CPU frequency might (depending on the HW architecture) not actually speed it up at all, but just waste power. In that case the perf counters mentioned in the article might be a better metric for freq scaling. I think Intel CPUs usually do this already but not exactly sure.
So yeah basically the utilisation metric isn't "wrong", you just have to understand what it actually means before using it.
I can improve IPC of almost any algorithm (assuming it is not already very high) by slipping lots of useless or nearly useless cheap integer operations into the code.
People always tell you "branch misses are bad" and "cache misses are bad". You should always ask: "compared to what"? If it was going to take you 20 cycles worth of frenzied, 4 instructions per clock, work to calculate something you could keep in a big table in L2 (assuming that you aren't contending for it) you might be better off eating the cache miss.
Similarly you could "improve" your IPC by avoiding branch misses (assuming no side effects) by calculating both sides of a unpredictable branch and using CMOV. This will save you branch misses and increase your IPC, but it may not improve the speed of your code (if the cost of the work is bigger than the cost of the branch misses).
You can reach near-full CPU utilization when processing sequential data (read/process/store). When using trees, hashes, etc. on "big enough data", it is very hard to have good CPU IPC (even when having 99% L1 cache hits, e.g. because of code where the branch predictor has no enough information, when many CPUs are accessing to data in the same zone of memory, etc.).
Also, depending on what algorithm you're doing, you'll never get the maximum IPC, even with optimal code, e.g. for a hash digest you don't need the store unit very much, so in the above CPU example you would reach an IPC of 3 with optimal code. So not always not having the maximum IPC means you're under-using the CPU for a given algorithm.
perf on a binary that is properly instrumented (so it can show you per-source-line or per-instruction data) is really ghreat.
I recall sometime back that Facebook specifically went with the Xeon-D processor for exactly this reason. Since Xeon D is single socket, it prevents NUMA type of issues.
We didn't even know there was a performance problem there- we just wanted to make the program faster, and ran perf, visualizing IPC and sorted by the routines with the lowest IPC (actually, we called it by the reciprocal, CPI, which I find a bit more intuitive). It sort of just provided a bright, blinking sign pointing right at the location causing a huge problem. Once that was solved, you could just select the next item on the list as the next thing to optimize :)
Enabling it allows you to see a clearer picture of not just stalls but also CPU steal from "noisy neighbors" -- guests also assigned to the same host.
I've seen CPU steal cause kernel warnings of "soft-lockups". I've also seen zombie processes occur. I suspect they're related but it's only anecdotal: I'm not sure how to investigate.
It's pretty amazing what kind of patterns you can identify when you've got stuff like that running. Machine seems to be non-responsive? Open up htop, see lots of grey... okay so since all data is on the network, that means that it's a data bottleneck; over the network means it could be bottlenecked at network bandwidth or the back-end SAN could be bottlenecked.
Fun fact: Windows Server doesn't like not having its disk IO not be serviced for minutes at a time. That's not a fun way to have another team come over and get angry because you're bluescreening their production boxes.
htop should add PMC support, and a lot more things too. I might even start using it then. :)
One of my favorite school assignments was we were given an intentionally bad implementation of the Game of Life compiled with -O3 and trying to get it to run faster without changing compiler flags. It's sort of mind boggling how fast computers can do stuff if you can reduce the problem to fixed stride for loops over arrays that can be fully pipelined.
1) Emacs prelude w/ helm enabled everywhere -https://github.com/bbatsov/prelude
2) Irony-mode for auto completion. Since you're using Cmake, you can just add CMAKE_EXPORT_COMPILE_COMMANDS to your invocation to output the needed metadata for completion to work. - https://github.com/Sarcasm/irony-mode
3) helm-flycheck for syntax checking
4) For jump to definition I generally just use helm-ag (grep) , I never actually got this working to my satisfaction despite trying a few things like rtags (slow, unstable) and clang-ctags (promising but annoying to set up at the time).
5) For formatting I usually just enable clang-format on save w/ something like this (requires clang-format mode installed):
(lambda () (add-hook 'before-save-hook #'clang-format-buffer nil t)))
There are many other optimisation strategies, with different trade-offs.
Very true that 100% CPU Utilization is often waiting on bus traffic (loading caches, loading ram, loading instructions, decoding instructions) only rarely is the CPU _doing_ useful work.
The context of what you are measuring depends if this is useful work or not. The initial access of a buffer almost universally stalls (unless you prefetched 100+ instructions ago). But starting to stream this data into L1 is useful work.
Aiming for 100%+ IPC is _beyond_ difficult even for simple algorithms and critical hot path functions. You not only require assembler cooperation (to assure decoder alignment), but you need to know _what_ processor you are running on to know the constraints of its decoder, uOP cache, and uOP cache alignment.
Perf gives you ability to cache per PID counters. Generally just look at Cycles Passed vs Instructions decoded.
This gives you a general overview of stalls. Once you dig into IPC, front end stalls, back end stalls. You start to see the turtles.
We honestly need a tool that compares I/O, memory fetch, cache-miss, TLB misses, page-outs, CPU Usage, interrupts, context-swaps, etc all in one place.
> The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes.
Nobody knows about the "or waiting for disk I/O (state D)" bit. So a bunch of processes doing disk I/O can cause loadavg spikes, but there can still be plenty of spare CPU.
As far as I understand it, the metric works as follows: At every clock interrupt (every 4ms on my machine) the system checks which process is currently running, before invoking the scheduler:
- If the idle process idle time is accounted.
- Otherwise the processer is regarded as utilized.
(This is what I got from reading the docs, and digging into the source code. I am not 100% confident I understand this completely at this point. If you know better please tell me!)
There are many problems with this approach:
Every time slice (4ms) is accounte either as completely utilized on completely free. There are many reasons for processes going on CPU or off CPU outside of clock interrupts. Blocking syscalls are the most obvious one.
In the end a time slice might be utilized by multiple different processes and interrupt handlers but if at the very end of the time slice the idle thread is scheduled on CPU the whole slice is counted as idle time!
The other metric you can mix with execution time is energy efficiency. That's about it. IPC is not a very good proxy. Fun to look at, but likely to be highly misleading.
When doing optimized code, the question "should I optimize for memory or for computing?" comes up often. Should I cache results? Should I use a more complex data structure in order to save memory or improve locality?
IPC is a good indicator on how you should tackle the problem. High IPC means you are may be doing too many calculations, while low IPC means that you should look at your memory usage. BTW, most of the time, memory is the problem.
What does IPC tell me about where my code could/should be async so that it's not stalled waiting for IO? Is combined IO rate a useful metric for this?
There's an interesting "Cost per GFLOPs" table here:
Btw these are great, thanks:
( I still couldn't fill this out if I tried: http://www.brendangregg.com/blog/2014-08-23/linux-perf-tools... )
This will let us find the false sharing cost (cache contention etc).
Now I wonder how easy and manual work it would be to do these combined flamegraphs with CPI/IPC information? My cursory search didn't find nary a mention after 2015... Perhaps this is still hard and complicated.
To me it seems really useful to know why a function takes so long to work (waiting or calculating) and not "merely" how long it takes. Even if the information is not perfectly reliable nor can't be measured without effect on execution.
The '3' in the command line is just how long the stats are averaged over, I'd recommend using 3+ to average out bursts of activity on a fairly steady-state system.
Only when I'm trying to get better code, do I need to care about IPC, and cache stalls. I may also want better code to improve the overall speed of execution, too.
 (~50% if you have a pair of redundant machines and load scales nicely, maybe 20% idle or even less if you have a large number of redundant machines and the load balances easily over them)
Correct me if I am wrong, but this won't work for spinlocks in busy loops: you do have a lot of instructions being executed, but the whole point of the loop is to wait for the cache to synchronize, and as such, this should be taken as "stalled".
See under "Other reasons CPU Utilization is misleading"
> Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress.
Let me help.
Look, CPU utilization is misleading. Did you forget to use -O2 when compiling your code? Oops, CPU utilization is now including all sorts of wasteful instructions that don't make forward progress, including pointless moves of dead data into registers.
Are you using Python or Perl? CPU utilization is misleading; it's counting all that time spent on bookkeeping code in the interpreter, not actually performing your logic.
CPU utilization also measures all that waste when nothing is happening, when arguments are being prepared for a library function. Your program has already stalled, but the library function hasn't started executing yet for the silly reason that the arguments aren't ready because the CPU is fumbling around with them.
Boy, what a useless measure.
The article makes the perfectly-reasonable observation that "CPU usage" also includes the time spent waiting on memory, that this time (as a %) has increased significantly in modern processors, and that visibility into stall time is potentially actionable to increase efficiency. It's useful to think about the continued relevance of performance metrics which were introduced in the 1970s and have continued basically unchanged into 2017. The article does not claim that "CPU usage", as a metric, is useless. Your comment is satirising a strawman.
If I have two programs which scan through a 100 Gb file to do the same processing, and both take the same amount of time (the task is I/O bound), which is more efficient?
Let's see: one uses 70% CPU, the other 5%: clear.
How is that wrong; where are we misled?
This is simply telling us that although the elapsed wall time is the same, one program is taking longer. We could look at the processor time instead of the percentage: that's just a rearrangement of the same figures. That 70% is just the 0.7 factor between the processor time and real time.
It is the article that is fighting a strawman because nobody uses raw CPU time or utilization to measure how well a program execution is using the internal resources of the processor.
I have never seen anyone claim, based only on processor time alone and nothing else, that a program which takes longer to do the same thing as another one is due to a specific cause, like caches being cold, or a poorer algorithm. That person would be misled and wrong, not CPU time per se.
Instructions per cycle can also be misleading. Modern cpu's can do multiple shifts per cycle, but something like division takes a long time.
It all doesn't matter anyway, as instructions per cycle does not tell you anything specific. Use the cpu-builtin performance counters, use perf. It basically works by sampling every once in a while. It (perf, or any other tool that uses performance counters) shows you exactly what instructions are taking up your processes time. (hint: it's usually the ones that read data from memory; so be nice to your caches)
It's not rocket surgery.
A) it can run another hyperthread with those cycles, and,
B) even without hyperthreads, it misleads the developers as to where to tune.
C) also, "busy waiting" sounds like an oxymoron. DRAM is busy. The CPU is waiting. Yes, the OS doesn't differentiate and calls it %CPU.
> It all doesn't matter anyway, as instructions per cycle does not tell you anything specific.
It provides a clue as to whether %CPU is instruction bound or stall bound.
> Use the cpu-builtin performance counters, use perf. It basically works by sampling every once in a while.
That's called sampling mode. The example in my post used counting mode.
Edit: added (C).
(terminology nitpick: HyperThreading is just Intel's proprietary trademark for SMT)
I don't see hyperthreading useful. Neither do the benchmarks, that usually put hypterthreading off ahead of hypterthreading on.
Reasoning is simple. When a cache miss happens the (intel) cpu runs another thread. The question is does that happen often enough to justify the overhead of that "hyper"threading. Remember that hyperthreading does not magically make a new cpu core, so there has to be overhead (and the benchmarks prove so). Even if there is no overhead, there is still the other thread filling the L1 (and L2) cache (programs that don't often read from nor write to memory are very rare). Apparently Agner Fog wrote a post about it .
Even without taking into account what i wrote above, hypterthreading would not be a big enough of influence on the sampling (i can draw a hypothetical timeline, if you don't get it).
>B) even without hyperthreads, it misleads the developers as to where to tune.
No, no it doesn't. How would it ? It shows you exactly what instruction is taking up time. If it is a MOV then your data is not in cache, if it is a divss then.. well you have a division in your code, if it is about evenly spread in a pattern then you should look if your code is serial (not that there is usually much help for that), if it is a branch then.. etc. How the F is that less helpful then a blanket "your code is not executing many instructions per tick" statement ? Instructions per tick only tells you if another process is f-ing with yours (probably by hogging memory bandwidth).
>C) also, "busy waiting" sounds like an oxymoron. DRAM is busy. The CPU is waiting. Yes, the OS doesn't differentiate and calls it %CPU.
Normally it is, but we are not writing poems here. Remember how old cpu's had a busy loop, that ran when idling ? AFAIK even today's cpus have it.
>It provides a clue as to whether %CPU is instruction bound or stall bound.
If i read a byte from a random place in a gigabyte array then do some shifts on it and store it back, it will still show a (relatively) high number of instructions but the bottleneck will be memory. On the other side if i load a float from a small array then do some math on it (notably division), it will result in a low number of instructions per tick but the memory will not be the bottleneck.
So all in all, instructions per tick doesn't tell you much. Maybe if the program suddenly starts to go slow, to more easily check what other program is hogging the memory access. Or a general rule of thumb that makes you think "this code should execute instructions much faster, hmm".
As for finding bottlenecks and dealing with them:
If your MOV instruction is a bottleneck, you are either dealing with a huge amount of data that can't be accessed in a predictable manner or you didn't think to make your code/data structures friendly to cache or your code is really complicated or just shit (as 90% of code is).
Perf also shows you cache misses...
Another great thing about perf (actually about objdump) is that it can interweave code and assembly when showing you where your code is slow. So you don't even have to learn assembly.
 I have a huge 2D array representing a grid. The grid will be sampled 1-5 points up, down, left, and right. The up and down sampling will, more often then not, result in cache misses (for obvious reasons). Solution ? Cut the grid into squares.
PS Hyperthreading sucks. I always turn it off.
edit: PPS No benchmarks, hypotheticals, deeper research into the causes ?
True, but useful? Most of us are busy trying to get writes across a networked service. Indeed, getting to 50% utilization is often a dangerous place.
For reference, running your car by focusing on rpm of the engine is silly. But, it is a very good proxy and even more silly to try and avoid it. Only if you are seriously instrumented is this a valid path. And getting that instrumented is not cheap or free.
Can't some people read an article without extrapolating to the end of the known universe and just stick to just the article's actual narrow subject?
It is a neat subject. Just as knowing different ways of measuring car utilization and power transfer. Framing it as a take down of how people commonly do it is extreme and at major risk of throwing out the baby with the bath water.
I guess this is why he used the term likely
Interesting article though
In-order CPUs are much easier to reason about; you can literally count the stalled cycles.
depends on the workload.
And that is per core, it becomes increasingly meaningless on a dualcore and on a quadcore and above you might as well replace it with MS Clippy.
And this is before discussing what that percentage really means.
edit: I'm interpreting the downvotes that people are in denial about this ;)
As far as hyperthreading goes, not understanding some is a bad reason to dislike it. For most developer purposes, pretend it is another CPU until you learn enough about profiling multithreaded apps that you need to care about the difference.
But, it really bothers me that it gets much harder to assess both the total load and the load produced by a single process. I haven't found any useful info on how to deal with this in simple overview user-level tools such as top or the windows task manager.
What are the disadvantages?
CPU time accounting measurements (sys/usr/idle) as reported by
standard tools do not reflect the side-effects of resource
sharing between hardware threads
It is impossible to correctly measure idle and extrapolate
available computing resources