Hacker News new | comments | show | ask | jobs | submit login
CPU Utilization is Wrong (brendangregg.com)
624 points by dmit 138 days ago | hide | past | web | 91 comments | favorite

I respect Brendan, and although it is an interesting article, I have to disagree with him: The OS tells you about OS CPU utilization, not CPU micro-architecture functional unit utilization. So if the OS uses a CPU for running code until a physical interrupt or a software trap happens, in that period the CPU has been doing work. Unless the CPU could be able to do a "free" context switch to a cached area not having to wait for e.g. a cache miss (hint: SMT/"hyperthreading" was invented exactly for that use case), the CPU would be actually busy.

If in the future (TM) using CPU performance counters for every process becomes really "free" (as in "gratis" or "cheap"), the OS could report bad performing processes because the reasons exposed in the article (low IPC indicating poor memory access patterns, unoptimized code, code using too small buffers for I/O -causing system performance degradation because excessive kernel processing time because-, etc.), showing the user that despite having high CPU usage, the CPU is not getting enough work done (in that sense I could agree with the article).

Yes, I understand this argument. %CPU, as we know it, tells you OS CPU utilization (what I called non-idle time), and you can look at the kernel code and how it is calculated and conclude that it's not wrong. It is what it is. It's non-idle time.

Your point that we can't utilize those cycles, so therefore they are utilized... Well, I have a production instance running with an IPC of 0.11 (yes, seriously), and system-wide %CPU of 60, and idle hyperthreads. If I added a high IPC workload to those idle hyperthreads, do you think I could recover those stalled cycles? Probably. So were those cycles utilized or not? :-)

As for using CPU counters for free: sure, they basically are. Just instructions to read & configure (WRMSR, RDMSR, RDPMC), and the OS can do it on kernel context switch, so that they can be associated per-process. It'd cost very very little: adding a few instructions to each context switch, and some members to task_struct. I'd be a bit annoyed if the kernel started doing this everywhere as it'd tie up some PMCs (physical resource), so I'd rather that behavior be configurable.

> It'd cost very very little: adding a few instructions to each context switch, and some members to task_struct.

RDPMC is fast. WRMSR is amazingly slow.

Put another way, when you're using WRMSR, your IPC (CPU usage?) might be around .002.

It is not cheap enough, yet. Rationale: although it could be easy to account user processes, it would not be that easy for the OS kernel and kernel components (e.g. figure how to account cache pollution because of hardware IRQ dispatching and devices doing DMA...). So, except for, may be, micro-kernel OS's, and adding performance counters to every single device capable of bus-mastering, it is far from being a reality, in my opinion.

It'd cost a little more than that, wouldn't it? I thought once the counters were configured there was a (potentially tiny?) overhead to any type of forward progress of RIP. You know way more about this than me, so glad to learn something if I'm wrong, and if you say it barely exists I'll be using far more of perf and other things like it... Maybe it's negligible enough that the data is worth the trade, anyway.

Great and timely article, by the way, though it disappointingly lacks a video of you yelling at CPUs.

if you say it barely exists I'll be using far more of perf and other things like it...

I'll back Brendan and claim that using hardware CPU counters on modern server processors is basically free. They are internal to the processor, and essentially don't utilize any resources shared with program execution. Maybe you could contrive a case where they cause faster thermal throttling, but I haven't seen it. For a simple count of events that is started and stopped at context switches (and when you are counting fewer events than available hardware counters) I'd be surprised if you even saw 1% overhead.

If you are trying to multiplex many more events than you have physical counters, the overhead of "reprogramming" the counters might cost you single digit overhead. If you are trying to save all stack traces, record all branches, or monitor all memory accesses you can manage to create a more significant slowdown, but even then you can usually lower this something acceptable by sampling.

So yes --- you probably should be using perf far more! Or VTune, or likwid (https://github.com/RRZE-HPC/likwid).

> it's not wrong. It is what it is. It's non-idle time.

exactly what i was going to tell you.

stick with that next time and stay away from 'wrong'.

And %CPU reports busy when an _external_ component, DRAM, is actually busy, and CPU is waiting. So it's not "CPU utilized". That's wrong.

Maybe the point from a user's perspective would be yours and from a developer's perspective would be his. Of course once the binary has been shipped to the user much can't be done there (except for things like hyperthreading as you say). However, a developer should not see 100% cpu as an indication that it can't be done faster and should realize that better algorithms and/or better memory access patterns can cause more work to be done even if both show 100% cpu. Of course I would argue that this should be obvious to all but novice programmers who worry about performance, but his article does paint the picture in a pretty clear way and will maybe help more realize what's really going on here.

I'm a programmer. And I use the CPU performance counters for optimizing code, so go figure :-)

I came here to say this but since you've already said it, I'll also provide an on-the-other-hand!

On the other hand, CPU utilisation is used to scale CPU frequency on some systems. This generally works OK but if a workload is spending most of its time blocking on the memory subsystem then scaling up the CPU frequency might (depending on the HW architecture) not actually speed it up at all, but just waste power. In that case the perf counters mentioned in the article might be a better metric for freq scaling. I think Intel CPUs usually do this already but not exactly sure.

So yeah basically the utilisation metric isn't "wrong", you just have to understand what it actually means before using it.

The problem is that IPC is also a crude metric. Even leaving aside fundamental algorithmic differences, an implementation of some algorithm with IPC of 0.5 is not necessarily faster than an implementation that somehow manages to hit every execution port and deliver an IPC of 4.

I can improve IPC of almost any algorithm (assuming it is not already very high) by slipping lots of useless or nearly useless cheap integer operations into the code.

People always tell you "branch misses are bad" and "cache misses are bad". You should always ask: "compared to what"? If it was going to take you 20 cycles worth of frenzied, 4 instructions per clock, work to calculate something you could keep in a big table in L2 (assuming that you aren't contending for it) you might be better off eating the cache miss.

Similarly you could "improve" your IPC by avoiding branch misses (assuming no side effects) by calculating both sides of a unpredictable branch and using CMOV. This will save you branch misses and increase your IPC, but it may not improve the speed of your code (if the cost of the work is bigger than the cost of the branch misses).

A sustained IPC of 2 is already quite high (e.g. for a OoOE CPU having 2 load/store units, and 2 [vector-]ALUs, i.e. having a maximum IPC of 4).

You can reach near-full CPU utilization when processing sequential data (read/process/store). When using trees, hashes, etc. on "big enough data", it is very hard to have good CPU IPC (even when having 99% L1 cache hits, e.g. because of code where the branch predictor has no enough information, when many CPUs are accessing to data in the same zone of memory, etc.).

You should say "ALU utilization" when that's what you mean.

Full CPU utilization regarding IPC is not just about the ALUs, it includes all units (ALU units, load/store, and other, e.g. if you have differentiated integer and floating point ALUs, etc.). E.g. with a 4 maximum IPC CPU (2 load/store, 2 ALU), on a synthetic benchmark doing only register-register ALU optimal work, you would get an IPC of 2.

Also, depending on what algorithm you're doing, you'll never get the maximum IPC, even with optimal code, e.g. for a hash digest you don't need the store unit very much, so in the above CPU example you would reach an IPC of 3 with optimal code. So not always not having the maximum IPC means you're under-using the CPU for a given algorithm.

If I understand correctly, the point is not raising IPC necessarily, it's finding what to optimize. If you have low IPC, optimize for memory access. If you have high IPC, optimize for code execution. This is in the article under "Interpretation and actionable items". In the end, what you want to improve is the wall-clock time of your program (or benchmarks, realistically). Slipping in useless operations is not going to do this.

IPC is amazing. We had some "slow" code, did a little profiling, and found that a hash lookup function was showing very low IPC about half the time. Turns out, the hash table was mapped across two memory domains on the server (NUMA) and the memory lookup from one processor the other processors memory was significantly slower.

perf on a binary that is properly instrumented (so it can show you per-source-line or per-instruction data) is really ghreat.

How much slower was the hash table when mapped across two physical memory domains vs on the same one?

I recall sometime back that Facebook specifically went with the Xeon-D processor for exactly this reason. Since Xeon D is single socket, it prevents NUMA type of issues.

It was about 50% slower. I would have expected more- 50% of the pages were on the opposite processor, and I expect the cost of a cross-processor communication (on a busy server) would be more than 2X a local memory lookup.

We didn't even know there was a performance problem there- we just wanted to make the program faster, and ran perf, visualizing IPC and sorted by the routines with the lowest IPC (actually, we called it by the reciprocal, CPI, which I find a bit more intuitive). It sort of just provided a bright, blinking sign pointing right at the location causing a huge problem. Once that was solved, you could just select the next item on the list as the next thing to optimize :)

The Facebook server design is discussed here: https://code.facebook.com/posts/1711485769063510/facebook-s-...

I use `htop` for all of my Linux machines. It's great software. But one of my biggest gripes is that "Detailed CPU Time" (F2 -> Display options -> Detailed CPU time) is not enabled by default.

Enabling it allows you to see a clearer picture of not just stalls but also CPU steal from "noisy neighbors" -- guests also assigned to the same host.

I've seen CPU steal cause kernel warnings of "soft-lockups". I've also seen zombie processes occur. I suspect they're related but it's only anecdotal: I'm not sure how to investigate.

It's pretty amazing what kind of patterns you can identify when you've got stuff like that running. Machine seems to be non-responsive? Open up htop, see lots of grey... okay so since all data is on the network, that means that it's a data bottleneck; over the network means it could be bottlenecked at network bandwidth or the back-end SAN could be bottlenecked.

Fun fact: Windows Server doesn't like not having its disk IO not be serviced for minutes at a time. That's not a fun way to have another team come over and get angry because you're bluescreening their production boxes.

Screenshot? My version of htop does not show stalls. The only top I've seen that uses PMCs is tiptop.

htop should add PMC support, and a lot more things too. I might even start using it then. :)

It's steal, not stall - time "stolen" by the hypervisor, i.e. when you have runnable threads but the CPU is busy servicing other VMs. This can lead to stalling of your software, I guess, but it has nothing to do with instruction stalls.

htop, at least in the version I'm using, only shows steal if it's non-zero.

steam time is available from mpstat. It's a machine-scoped metric so it should not belong to top which is process-scoped.

`top`, as well as `htop`, report some global stats like total memory utilization, swap usage, load average and CPU usage. So that’s totally fine.

Perf is fascinating to dive into. If you are using C and gcc you can use record/report that show you line by line and instruction by instruction where you are getting slowdowns.

One of my favorite school assignments was we were given an intentionally bad implementation of the Game of Life compiled with -O3 and trying to get it to run faster without changing compiler flags. It's sort of mind boggling how fast computers can do stuff if you can reduce the problem to fixed stride for loops over arrays that can be fully pipelined.

If you think perf is impressive you should try Intel vTune. Has support for finding hot mutexes, all perf counters, and even generates neat graphs that show diagrams of synchronization points between multiple threads.

Hi Hendzen, I found in one of your previous comments that you have set up your emacs to have full support for c++ projects, like auto-completion and reference following and similar. I have been using Emacs for a few years now, for js/python/html/C/C++/anything else and have just started working on a larger C++ project where intelligent auto completion and navigation would help a lot. Would you mind providing some more details on how you got it set up, maybe share your init.el file? It is a Cmake project, uses ninja as a generator. Thanks!

I'm writing less C++ than I used to recently, but here's the basic setup I use:

1) Emacs prelude w/ helm enabled everywhere -https://github.com/bbatsov/prelude

2) Irony-mode for auto completion. Since you're using Cmake, you can just add CMAKE_EXPORT_COMPILE_COMMANDS to your invocation to output the needed metadata for completion to work. - https://github.com/Sarcasm/irony-mode

3) helm-flycheck for syntax checking https://github.com/yasuyk/helm-flycheck

4) For jump to definition I generally just use helm-ag (grep) , I never actually got this working to my satisfaction despite trying a few things like rtags (slow, unstable) and clang-ctags (promising but annoying to set up at the time).

5) For formatting I usually just enable clang-format on save w/ something like this (requires clang-format mode installed):

  (add-hook 'c++-mode-hook
    (lambda () (add-hook 'before-save-hook #'clang-format-buffer nil t)))

Are there affordable options for non-enterprise customers? I use it at work, but we buy bulk licenses and I was under the impression that licenses start at several thousands per seat.

Second that. It is amazing for C and C++ code - but does not work well on VMs such as AWS or with Java code.

... I believe I was the first person to run vTune from an AWS EC2 guest. I'll share details when I can. :)

That's true and a nice exercise for a programming class, but if you "reduce the problem to fixed stride for loops over arrays that can be fully pipelined." to make the game of life as fast as possible, you're doing it wrong (for those who don't know about it: https://en.wikipedia.org/wiki/Hashlife)

It's also true and nice for most modern game engines. I'm quite sure Game of life was just a subject for the exercise with the purpose to show that you can get major speedups without changing the algorithm - you don't always have a better one as backup.

Yeah for sure. We only had to calculate 1 or 2 iterations of the game and had a random seeded array as input and preallocated array we had to populate as output so there was a lot of optimization that wasn't worth it if you were actually looking to implement Game of Life, it was designed specifically to get you to think about computer system stuff.

Hashlife is impressive, but don't misread the above comment as saying it is "the" right way of doing it. Hashlife is good if there is lots of repetition (or sparseness).

There are many other optimisation strategies, with different trade-offs.

We are what we measure.

Very true that 100% CPU Utilization is often waiting on bus traffic (loading caches, loading ram, loading instructions, decoding instructions) only rarely is the CPU _doing_ useful work.

The context of what you are measuring depends if this is useful work or not. The initial access of a buffer almost universally stalls (unless you prefetched 100+ instructions ago). But starting to stream this data into L1 is useful work.

Aiming for 100%+ IPC is _beyond_ difficult even for simple algorithms and critical hot path functions. You not only require assembler cooperation (to assure decoder alignment), but you need to know _what_ processor you are running on to know the constraints of its decoder, uOP cache, and uOP cache alignment.


Perf gives you ability to cache per PID counters. Generally just look at Cycles Passed vs Instructions decoded.

This gives you a general overview of stalls. Once you dig into IPC, front end stalls, back end stalls. You start to see the turtles.

At Tera, we were able to issue 1 instruction/cycle/CPU. The hardware could measure the number of missed opportunities (we called them phantoms) over a period of time, so we could report percent utilization accurately. Indeed, we could graph it over time and map periods of high/low utilization back to points in the code (typically parallel/serial loops), with notes about what the compiler thought was going on. It was a pretty useful arrangement.

Your CPU will execute a program just as fast at 5% than as 75%.

We honestly need a tool that compares I/O, memory fetch, cache-miss, TLB misses, page-outs, CPU Usage, interrupts, context-swaps, etc all in one place.

There's also loadavg. I've encountered a lot of people who think that a high loadavg MUST imply a lot of CPU use. Not on Linux, at least:

> The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes.

Nobody knows about the "or waiting for disk I/O (state D)" bit. So a bunch of processes doing disk I/O can cause loadavg spikes, but there can still be plenty of spare CPU.

It seems to me that the CPU utlization metric (from /proc/stat) has far more problems than misreporting memory stalls.

As far as I understand it, the metric works as follows: At every clock interrupt (every 4ms on my machine) the system checks which process is currently running, before invoking the scheduler: - If the idle process idle time is accounted. - Otherwise the processer is regarded as utilized.

(This is what I got from reading the docs, and digging into the source code. I am not 100% confident I understand this completely at this point. If you know better please tell me!)

There are many problems with this approach: Every time slice (4ms) is accounte either as completely utilized on completely free. There are many reasons for processes going on CPU or off CPU outside of clock interrupts. Blocking syscalls are the most obvious one. In the end a time slice might be utilized by multiple different processes and interrupt handlers but if at the very end of the time slice the idle thread is scheduled on CPU the whole slice is counted as idle time!

See also: https://github.com/torvalds/linux/blob/master/Documentation/...

The article is interesting, but IPC is the wrong metric to focus on. Frankly, the only thing we should care about when it comes to performance is time to finish a task. It doesn't matter if it takes more instructions to compute something, as long as it's done faster.

The other metric you can mix with execution time is energy efficiency. That's about it. IPC is not a very good proxy. Fun to look at, but likely to be highly misleading.

The idea here is that it appears that you are limited by the CPU and want to make things faster.

When doing optimized code, the question "should I optimize for memory or for computing?" comes up often. Should I cache results? Should I use a more complex data structure in order to save memory or improve locality?

IPC is a good indicator on how you should tackle the problem. High IPC means you are may be doing too many calculations, while low IPC means that you should look at your memory usage. BTW, most of the time, memory is the problem.

Add maximum latency to that for interactive or realtime use. That tends to depend on memory access patterns.

Instructions per cycle: https://en.wikipedia.org/wiki/Instructions_per_cycle

What does IPC tell me about where my code could/should be async so that it's not stalled waiting for IO? Is combined IO rate a useful metric for this?

There's an interesting "Cost per GFLOPs" table here: https://en.wikipedia.org/wiki/FLOPS

Btw these are great, thanks: http://www.brendangregg.com/linuxperf.html

( I still couldn't fill this out if I tried: http://www.brendangregg.com/blog/2014-08-23/linux-perf-tools... )

It's not that your code should be async, but that it should be more cache-friendly. It's probably mostly stalled waiting for RAM.

Oh, is this because of context switching for resource staring?

Another related tool I found interesting: perf c2c

This will let us find the false sharing cost (cache contention etc).


By clicking through some links on the article I found this: http://www.brendangregg.com/blog/2014-10-31/cpi-flame-graphs...

Now I wonder how easy and manual work it would be to do these combined flamegraphs with CPI/IPC information? My cursory search didn't find nary a mention after 2015... Perhaps this is still hard and complicated.

To me it seems really useful to know why a function takes so long to work (waiting or calculating) and not "merely" how long it takes. Even if the information is not perfectly reliable nor can't be measured without effect on execution.

Interestingly IPCs are also used to verify new chipsets in embedded companies. Run the same code with newer generation chipset and see if IPC is better than the previous. IPCs are one of the main criteria if the new chipset is a hit or miss (others are power..)

Interesting. That must get messed up whenever they change the architecture in a major way, eg, Intel Skylake going from 4-wide to 5-wide.

AFAIK this is used in embedded chipsets that come with preloaded software like modems.

I can't see a mention of it here, or on the original page, so IMO it's worth pointing out a utility that you will most likely already have installed on your Linux machine: vmstat. Just run:

   vmstat 3
And you'll get a running breakdown of CPU usage (split into user/system), and a breakdown of 'idle' time (split into actual idle time and time waiting for I/O (or some kinds of locks).

The '3' in the command line is just how long the stats are averaged over, I'd recommend using 3+ to average out bursts of activity on a fairly steady-state system.

I didn't know about tiptop, and it sounds interesting. Running it, though, it only shows "?" in Ncycle, Minstr, IPC, %MISS, %BMIS and %BUS colums for a lot of processes, including for, but not limited to, Firefox.

On the cloud? On AWS EC2? The PMCs required for tiptop to work were only enabled last week, for dedicated full-host systems: http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2....

On my laptop. And I do get values for some processes. But not all of them.

CPU util might be misleading, but cpu idle under a threshold at peak [1] means you need more idle cpu and you can get that by getting more machines, getting better machines, or getting better code.

Only when I'm trying to get better code, do I need to care about IPC, and cache stalls. I may also want better code to improve the overall speed of execution, too.

[1] (~50% if you have a pair of redundant machines and load scales nicely, maybe 20% idle or even less if you have a large number of redundant machines and the load balances easily over them)

The server seems overloaded (somewhat ironically). Try http://archive.is/stDR0 .

CPU frequency scaling can also lead to somewhat unintuitive results. On few occasions I've seen CPU load % increasing significantly after code was optimized. Optimization was still actually valid, and the actual executed instructions per work item went down, but the CPU load % went up since OS decided to clock down the CPU due to reduced workload.

> You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC)

Correct me if I am wrong, but this won't work for spinlocks in busy loops: you do have a lot of instructions being executed, but the whole point of the loop is to wait for the cache to synchronize, and as such, this should be taken as "stalled".

Which is mentioned in the article (which few people fully read of course)!

See under "Other reasons CPU Utilization is misleading"

> Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress.

I think thinking about the CPU add mainly the ALU seems myopic. The job of the CPU is to get data into the right pipeline at the right time. Waiting for a cache miss means it's busy doing its job. Thus, CPU busy is a reasonable metric the way it is currently defines and measured. (After all, the memory controller is part of the CPU these days.)

It also is reasonable to know about cache misses so that you can do something about that if you decide that's possible. Do you think because it is not always valuable information, maybe even rarely (when you average over all programmers worldwide, most of whom do web stuff) that it never is, for anyone ever?

This article is not as silly as it could be.

Let me help.

Look, CPU utilization is misleading. Did you forget to use -O2 when compiling your code? Oops, CPU utilization is now including all sorts of wasteful instructions that don't make forward progress, including pointless moves of dead data into registers.

Are you using Python or Perl? CPU utilization is misleading; it's counting all that time spent on bookkeeping code in the interpreter, not actually performing your logic.

CPU utilization also measures all that waste when nothing is happening, when arguments are being prepared for a library function. Your program has already stalled, but the library function hasn't started executing yet for the silly reason that the arguments aren't ready because the CPU is fumbling around with them.

Boy, what a useless measure.

> Boy, what a useless measure.

The article makes the perfectly-reasonable observation that "CPU usage" also includes the time spent waiting on memory, that this time (as a %) has increased significantly in modern processors, and that visibility into stall time is potentially actionable to increase efficiency. It's useful to think about the continued relevance of performance metrics which were introduced in the 1970s and have continued basically unchanged into 2017. The article does not claim that "CPU usage", as a metric, is useless. Your comment is satirising a strawman.

It isn't called useless, just "misleading" and "wrong". How useful is something if it is both?

If I have two programs which scan through a 100 Gb file to do the same processing, and both take the same amount of time (the task is I/O bound), which is more efficient?

Let's see: one uses 70% CPU, the other 5%: clear.

How is that wrong; where are we misled?

This is simply telling us that although the elapsed wall time is the same, one program is taking longer. We could look at the processor time instead of the percentage: that's just a rearrangement of the same figures. That 70% is just the 0.7 factor between the processor time and real time.

It is the article that is fighting a strawman because nobody uses raw CPU time or utilization to measure how well a program execution is using the internal resources of the processor.

I have never seen anyone claim, based only on processor time alone and nothing else, that a program which takes longer to do the same thing as another one is due to a specific cause, like caches being cold, or a poorer algorithm. That person would be misled and wrong, not CPU time per se.

brendangregg is clearly trying to educate people who may not be aware of the difference between "CPU is busy" and "CPU is doing useful work." You already know this, and that's okay! You're not the target audience. I concede his title was slightly on the side of "I'm going to say something simple and catchy, then walk it back a bit with a nuanced explanation", but I'm willing to give him slack considering his numerous (http://www.brendangregg.com/portfolio.html) contributions to the literature for analyzing systems performance.

The core waiting for data to be loaded from RAM is busy. Busy waiting for data.

Instructions per cycle can also be misleading. Modern cpu's can do multiple shifts per cycle, but something like division takes a long time.

It all doesn't matter anyway, as instructions per cycle does not tell you anything specific. Use the cpu-builtin performance counters, use perf. It basically works by sampling every once in a while. It (perf, or any other tool that uses performance counters) shows you exactly what instructions are taking up your processes time. (hint: it's usually the ones that read data from memory; so be nice to your caches)

It's not rocket surgery.

> The core waiting for data to be loaded from RAM is busy. Busy waiting for data.

A) it can run another hyperthread with those cycles, and,

B) even without hyperthreads, it misleads the developers as to where to tune.

C) also, "busy waiting" sounds like an oxymoron. DRAM is busy. The CPU is waiting. Yes, the OS doesn't differentiate and calls it %CPU.

> It all doesn't matter anyway, as instructions per cycle does not tell you anything specific.

It provides a clue as to whether %CPU is instruction bound or stall bound.

> Use the cpu-builtin performance counters, use perf. It basically works by sampling every once in a while.

That's called sampling mode. The example in my post used counting mode.

Edit: added (C).

I think A is relatively uncommon, as long as mainstream CPUs only have 2 threads. It may give you a boost when you happen to have between N and N*2 runnable threads, for N = number of cores. And those idle CPU threads were already showing up as idle logical processors in top so you would have had a good reason to try to employ them earlier already.

(terminology nitpick: HyperThreading is just Intel's proprietary trademark for SMT)

>A) it can run another hyperthread with those cycles, and,

I don't see hyperthreading useful. Neither do the benchmarks, that usually put hypterthreading off ahead of hypterthreading on.

Reasoning is simple. When a cache miss happens the (intel) cpu runs another thread. The question is does that happen often enough to justify the overhead of that "hyper"threading. Remember that hyperthreading does not magically make a new cpu core, so there has to be overhead (and the benchmarks prove so). Even if there is no overhead, there is still the other thread filling the L1 (and L2) cache (programs that don't often read from nor write to memory are very rare). Apparently Agner Fog wrote a post about it [0].

Even without taking into account what i wrote above, hypterthreading would not be a big enough of influence on the sampling (i can draw a hypothetical timeline, if you don't get it).

>B) even without hyperthreads, it misleads the developers as to where to tune.

No, no it doesn't. How would it ? It shows you exactly what instruction is taking up time. If it is a MOV then your data is not in cache, if it is a divss then.. well you have a division in your code, if it is about evenly spread in a pattern then you should look if your code is serial (not that there is usually much help for that), if it is a branch then.. etc. How the F is that less helpful then a blanket "your code is not executing many instructions per tick" statement ? Instructions per tick only tells you if another process is f-ing with yours (probably by hogging memory bandwidth).

>C) also, "busy waiting" sounds like an oxymoron. DRAM is busy. The CPU is waiting. Yes, the OS doesn't differentiate and calls it %CPU.

Normally it is, but we are not writing poems here. Remember how old cpu's had a busy loop, that ran when idling ? AFAIK even today's cpus have it.

>It provides a clue as to whether %CPU is instruction bound or stall bound.

If i read a byte from a random place in a gigabyte array then do some shifts on it and store it back, it will still show a (relatively) high number of instructions but the bottleneck will be memory. On the other side if i load a float from a small array then do some math on it (notably division), it will result in a low number of instructions per tick but the memory will not be the bottleneck.

So all in all, instructions per tick doesn't tell you much. Maybe if the program suddenly starts to go slow, to more easily check what other program is hogging the memory access. Or a general rule of thumb that makes you think "this code should execute instructions much faster, hmm".

As for finding bottlenecks and dealing with them:

If your MOV instruction is a bottleneck, you are either dealing with a huge amount of data that can't be accessed in a predictable manner or you didn't think to make your code/data structures friendly to cache[1] or your code is really complicated or just shit (as 90% of code is).

Perf also shows you cache misses...

Another great thing about perf (actually about objdump) is that it can interweave code and assembly when showing you where your code is slow. So you don't even have to learn assembly.

[0] http://www.agner.org/optimize/blog/read.php?i=6

[1] I have a huge 2D array representing a grid. The grid will be sampled 1-5 points up, down, left, and right. The up and down sampling will, more often then not, result in cache misses (for obvious reasons). Solution ? Cut the grid into squares.

PS Hyperthreading sucks. I always turn it off.

edit: PPS No benchmarks, hypotheticals, deeper research into the causes ?

This is silly. The conceit that ipc is a simplification for "higher is better" is exactly the problem he has with utilization.

True, but useful? Most of us are busy trying to get writes across a networked service. Indeed, getting to 50% utilization is often a dangerous place.

For reference, running your car by focusing on rpm of the engine is silly. But, it is a very good proxy and even more silly to try and avoid it. Only if you are seriously instrumented is this a valid path. And getting that instrumented is not cheap or free.

This is a gross misrepresentation of the article. It is in your own head that he solely focuses on this one thing. When somebody says "this soup needs more salt" you start talking about how there always is too much salt in processed food these days? The article is a about a concrete narrow subject, not about "performance" (the entire field).

Can't some people read an article without extrapolating to the end of the known universe and just stick to just the article's actual narrow subject?

Take my criticism as being on the chosen title and framing.

It is a neat subject. Just as knowing different ways of measuring car utilization and power transfer. Framing it as a take down of how people commonly do it is extreme and at major risk of throwing out the baby with the bath water.

Is there any easy way to do profiling that reveals stalled cpu becasue of pointer chasing, for "high level devs" on windows?

Any way to do something equivalent on OSX?

This was very enlightening. I have the highest respect for Brendan and his insights, i must say.

or your code could be riddled with thread contentions

I guess this is why he used the term likely

Interesting article though

Using IPC as a proxy for utilization is tricky because an out-of-order machine can only get that max IPC if the instructions it is executing are not dependent on not-yet-computed instructions.

In-order CPUs are much easier to reason about; you can literally count the stalled cycles.

Need a new metric "CPU efficiency".

Totally disagree with the premise of the article. Every metric tool that i know of that shows cpu utilization correctly shows cpu work. Load on the other hand represents cpu and iowait (overall system throughput). Io wait is also exposed in top as the "wait" metric. An amazon EC2 box can very easily get to load(5) = 10 (anything above 1 is considered bad), but the cpu utilization metric will still show almost no cpu util.

Sorry, I'm not talking about iowait at all. I think you're thinking about load averages, which I'm not talking about in that article. I'm actually talking about memory stalls.

> If your IPC is < 1.0, you are likely memory stalled,

depends on the workload.

It's always nice to see when people take even less than a complete sentence and pretend that this is what the author said and that this was all.

Better yet if the response is even shorter than the excerpt quoted ;)

> For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.

Well, this is the reason I hate HyperThreading, does your app consume 50% or 100% - with hyperthreading you have no clue.

And that is per core, it becomes increasingly meaningless on a dualcore and on a quadcore and above you might as well replace it with MS Clippy.

And this is before discussing what that percentage really means.

edit: I'm interpreting the downvotes that people are in denial about this ;)

Different task monitors make this clear in their docs and they tend to stick with it. Read up on the tools you use once and this source of mystery will be gone forever.

As far as hyperthreading goes, not understanding some is a bad reason to dislike it. For most developer purposes, pretend it is another CPU until you learn enough about profiling multithreaded apps that you need to care about the difference.

My comment was from a user perspective. I don't mind the performance increase from HT (and that is why I still have it enabled).

But, it really bothers me that it gets much harder to assess both the total load and the load produced by a single process. I haven't found any useful info on how to deal with this in simple overview user-level tools such as top or the windows task manager.

What are the disadvantages?

CPU time accounting measurements (sys/usr/idle) as reported by standard tools do not reflect the side-effects of resource sharing between hardware threads

It is impossible to correctly measure idle and extrapolate available computing resources


Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact