
CPU Utilization is Wrong - dmit
http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
======
faragon
I respect Brendan, and although it is an interesting article, I have to
disagree with him: The OS tells you about OS CPU utilization, not CPU micro-
architecture functional unit utilization. So if the OS uses a CPU for running
code until a physical interrupt or a software trap happens, in that period the
CPU has been doing work. Unless the CPU could be able to do a "free" context
switch to a cached area not having to wait for e.g. a cache miss (hint:
SMT/"hyperthreading" was invented exactly for that use case), the CPU would be
actually busy.

If in the future (TM) using CPU performance counters for every process becomes
really "free" (as in "gratis" or "cheap"), the OS could report bad performing
processes because the reasons exposed in the article (low IPC indicating poor
memory access patterns, unoptimized code, code using too small buffers for I/O
-causing system performance degradation because excessive kernel processing
time because-, etc.), showing the user that despite having high CPU usage, the
CPU is not getting enough work done (in that sense I could agree with the
article).

~~~
brendangregg
Yes, I understand this argument. %CPU, as we know it, tells you OS CPU
utilization (what I called non-idle time), and you can look at the kernel code
and how it is calculated and conclude that it's not wrong. It is what it is.
It's non-idle time.

Your point that we can't utilize those cycles, so therefore they are
utilized... Well, I have a production instance running with an IPC of 0.11
(yes, seriously), and system-wide %CPU of 60, and idle hyperthreads. If I
added a high IPC workload to those idle hyperthreads, do you think I could
recover those stalled cycles? Probably. So were those cycles utilized or not?
:-)

As for using CPU counters for free: sure, they basically are. Just
instructions to read & configure (WRMSR, RDMSR, RDPMC), and the OS can do it
on kernel context switch, so that they can be associated per-process. It'd
cost very very little: adding a few instructions to each context switch, and
some members to task_struct. I'd be a bit annoyed if the kernel started doing
this everywhere as it'd tie up some PMCs (physical resource), so I'd rather
that behavior be configurable.

~~~
jsmthrowaway
It'd cost a little more than that, wouldn't it? I thought once the counters
were configured there was a (potentially tiny?) overhead to any type of
forward progress of RIP. You know way more about this than me, so glad to
learn something if I'm wrong, and if you say it barely exists I'll be using
_far more_ of perf and other things like it... Maybe it's negligible enough
that the data is worth the trade, anyway.

Great and timely article, by the way, though it disappointingly lacks a video
of you yelling at CPUs.

~~~
nkurz
_if you say it barely exists I 'll be using far more of perf and other things
like it..._

I'll back Brendan and claim that using hardware CPU counters on modern server
processors is basically free. They are internal to the processor, and
essentially don't utilize any resources shared with program execution. Maybe
you could contrive a case where they cause faster thermal throttling, but I
haven't seen it. For a simple count of events that is started and stopped at
context switches (and when you are counting fewer events than available
hardware counters) I'd be surprised if you even saw 1% overhead.

If you are trying to multiplex many more events than you have physical
counters, the overhead of "reprogramming" the counters might cost you single
digit overhead. If you are trying to save all stack traces, record all
branches, or monitor all memory accesses you can manage to create a more
significant slowdown, but even then you can usually lower this something
acceptable by sampling.

So yes --- you probably should be using perf far more! Or VTune, or likwid
([https://github.com/RRZE-HPC/likwid](https://github.com/RRZE-HPC/likwid)).

------
glangdale
The problem is that IPC is also a crude metric. Even leaving aside fundamental
algorithmic differences, an implementation of some algorithm with IPC of 0.5
is not necessarily faster than an implementation that somehow manages to hit
every execution port and deliver an IPC of 4.

I can improve IPC of almost any algorithm (assuming it is not already very
high) by slipping lots of useless or nearly useless cheap integer operations
into the code.

People always tell you "branch misses are bad" and "cache misses are bad". You
should always ask: "compared to what"? If it was going to take you 20 cycles
worth of frenzied, 4 instructions per clock, work to calculate something you
could keep in a big table in L2 (assuming that you aren't contending for it)
you might be better off eating the cache miss.

Similarly you could "improve" your IPC by avoiding branch misses (assuming no
side effects) by calculating both sides of a unpredictable branch and using
CMOV. This will save you branch misses and increase your IPC, but it may not
improve the speed of your code (if the cost of the work is bigger than the
cost of the branch misses).

~~~
faragon
A sustained IPC of 2 is already quite high (e.g. for a OoOE CPU having 2
load/store units, and 2 [vector-]ALUs, i.e. having a maximum IPC of 4).

You can reach near-full CPU utilization when processing sequential data
(read/process/store). When using trees, hashes, etc. on "big enough data", it
is very hard to have good CPU IPC (even when having 99% L1 cache hits, e.g.
because of code where the branch predictor has no enough information, when
many CPUs are accessing to data in the same zone of memory, etc.).

~~~
jwatte
You should say "ALU utilization" when that's what you mean.

~~~
faragon
Full CPU utilization regarding IPC is not just about the ALUs, it includes all
units (ALU units, load/store, and other, e.g. if you have differentiated
integer and floating point ALUs, etc.). E.g. with a 4 maximum IPC CPU (2
load/store, 2 ALU), on a synthetic benchmark doing only register-register ALU
optimal work, you would get an IPC of 2.

Also, depending on what algorithm you're doing, you'll never get the maximum
IPC, even with optimal code, e.g. for a hash digest you don't need the store
unit very much, so in the above CPU example you would reach an IPC of 3 with
optimal code. So not always not having the maximum IPC means you're under-
using the CPU for a given algorithm.

------
dekhn
IPC is amazing. We had some "slow" code, did a little profiling, and found
that a hash lookup function was showing very low IPC about half the time.
Turns out, the hash table was mapped across two memory domains on the server
(NUMA) and the memory lookup from one processor the other processors memory
was significantly slower.

perf on a binary that is properly instrumented (so it can show you per-source-
line or per-instruction data) is really ghreat.

~~~
alberth
How much slower was the hash table when mapped across two physical memory
domains vs on the same one?

I recall sometime back that Facebook specifically went with the Xeon-D
processor for exactly this reason. Since Xeon D is single socket, it prevents
NUMA type of issues.

~~~
dekhn
It was about 50% slower. I would have expected more- 50% of the pages were on
the opposite processor, and I expect the cost of a cross-processor
communication (on a busy server) would be more than 2X a local memory lookup.

We didn't even know there was a performance problem there- we just wanted to
make the program faster, and ran perf, visualizing IPC and sorted by the
routines with the lowest IPC (actually, we called it by the reciprocal, CPI,
which I find a bit more intuitive). It sort of just provided a bright,
blinking sign pointing right at the location causing a huge problem. Once that
was solved, you could just select the next item on the list as the next thing
to optimize :)

------
inetknght
I use `htop` for all of my Linux machines. It's great software. But one of my
biggest gripes is that "Detailed CPU Time" (F2 -> Display options -> Detailed
CPU time) is not enabled by default.

Enabling it allows you to see a clearer picture of not just stalls but also
CPU steal from "noisy neighbors" \-- guests also assigned to the same host.

I've seen CPU steal cause kernel warnings of "soft-lockups". I've also seen
zombie processes occur. I suspect they're related but it's only anecdotal: I'm
not sure how to investigate.

It's pretty amazing what kind of patterns you can identify when you've got
stuff like that running. Machine seems to be non-responsive? Open up htop, see
lots of grey... okay so since _all_ data is on the network, that means that
it's a data bottleneck; over the network means it could be bottlenecked at
network bandwidth or the back-end SAN could be bottlenecked.

Fun fact: Windows Server doesn't like not having its disk IO not be serviced
for minutes at a time. That's not a fun way to have another team come over and
get angry because you're bluescreening their production boxes.

~~~
brendangregg
Screenshot? My version of htop does not show stalls. The only top I've seen
that uses PMCs is tiptop.

htop should add PMC support, and a lot more things too. I might even start
using it then. :)

~~~
qb45
It's steal, not stall - time "stolen" by the hypervisor, i.e. when you have
runnable threads but the CPU is busy servicing other VMs. This can lead to
stalling of your software, I guess, but it has nothing to do with instruction
stalls.

------
nimos
Perf is fascinating to dive into. If you are using C and gcc you can use
record/report that show you line by line and instruction by instruction where
you are getting slowdowns.

One of my favorite school assignments was we were given an intentionally bad
implementation of the Game of Life compiled with -O3 and trying to get it to
run faster without changing compiler flags. It's sort of mind boggling how
fast computers can do stuff if you can reduce the problem to fixed stride for
loops over arrays that can be fully pipelined.

~~~
hendzen
If you think perf is impressive you should try Intel vTune. Has support for
finding hot mutexes, all perf counters, and even generates neat graphs that
show diagrams of synchronization points between multiple threads.

~~~
Martinsos
Hi Hendzen, I found in one of your previous comments that you have set up your
emacs to have full support for c++ projects, like auto-completion and
reference following and similar. I have been using Emacs for a few years now,
for js/python/html/C/C++/anything else and have just started working on a
larger C++ project where intelligent auto completion and navigation would help
a lot. Would you mind providing some more details on how you got it set up,
maybe share your init.el file? It is a Cmake project, uses ninja as a
generator. Thanks!

~~~
hendzen
I'm writing less C++ than I used to recently, but here's the basic setup I
use:

1) Emacs prelude w/ helm enabled everywhere
-[https://github.com/bbatsov/prelude](https://github.com/bbatsov/prelude)

2) Irony-mode for auto completion. Since you're using Cmake, you can just add
CMAKE_EXPORT_COMPILE_COMMANDS to your invocation to output the needed metadata
for completion to work. - [https://github.com/Sarcasm/irony-
mode](https://github.com/Sarcasm/irony-mode)

3) helm-flycheck for syntax checking [https://github.com/yasuyk/helm-
flycheck](https://github.com/yasuyk/helm-flycheck)

4) For jump to definition I generally just use helm-ag (grep) , I never
actually got this working to my satisfaction despite trying a few things like
rtags (slow, unstable) and clang-ctags (promising but annoying to set up at
the time).

5) For formatting I usually just enable clang-format on save w/ something like
this (requires clang-format mode installed):

    
    
      (add-hook 'c++-mode-hook
        (lambda () (add-hook 'before-save-hook #'clang-format-buffer nil t)))

------
valarauca1
We are what we measure.

Very true that 100% CPU Utilization is often waiting on bus traffic (loading
caches, loading ram, loading instructions, decoding instructions) only rarely
is the CPU _doing_ useful work.

The context of what you are measuring depends if this is useful work or not.
The initial access of a buffer almost universally stalls (unless you
prefetched 100+ instructions ago). But starting to stream this data into L1 is
useful work.

Aiming for 100%+ IPC is _beyond_ difficult even for simple algorithms and
critical hot path functions. You not only require assembler cooperation (to
assure decoder alignment), but you need to know _what_ processor you are
running on to know the constraints of its decoder, uOP cache, and uOP cache
alignment.

\---

Perf gives you ability to cache per PID counters. Generally just look at
Cycles Passed vs Instructions decoded.

This gives you a general overview of stalls. Once you dig into IPC, front end
stalls, back end stalls. You start to see the turtles.

------
prestonbriggs
At Tera, we were able to issue 1 instruction/cycle/CPU. The hardware could
measure the number of missed opportunities (we called them phantoms) over a
period of time, so we could report percent utilization accurately. Indeed, we
could graph it over time and map periods of high/low utilization back to
points in the code (typically parallel/serial loops), with notes about what
the compiler thought was going on. It was a pretty useful arrangement.

------
exabrial
Your CPU will execute a program just as fast at 5% than as 75%.

We honestly need a tool that compares I/O, memory fetch, cache-miss, TLB
misses, page-outs, CPU Usage, interrupts, context-swaps, etc all in one place.

------
deathanatos
There's also loadavg. I've encountered a lot of people who think that a high
loadavg MUST imply a lot of CPU use. Not on Linux, at least:

> _The first three fields in this file are load average figures giving the
> number of jobs in the run queue (state R) or waiting for disk I /O (state D)
> averaged over 1, 5, and 15 minutes._

Nobody knows about the "or waiting for disk I/O (state D)" bit. So a bunch of
processes doing disk I/O can cause loadavg spikes, but there can still be
plenty of spare CPU.

------
heinrichhartman
It seems to me that the CPU utlization metric (from /proc/stat) has far more
problems than misreporting memory stalls.

As far as I understand it, the metric works as follows: At every clock
interrupt (every 4ms on my machine) the system checks which process is
currently running, before invoking the scheduler: \- If the idle process idle
time is accounted. \- Otherwise the processer is regarded as utilized.

(This is what I got from reading the docs, and digging into the source code. I
am not 100% confident I understand this completely at this point. If you know
better please tell me!)

There are many problems with this approach: Every time slice (4ms) is accounte
either as completely utilized on completely free. There are many reasons for
processes going on CPU or off CPU outside of clock interrupts. Blocking
syscalls are the most obvious one. In the end a time slice might be utilized
by multiple different processes and interrupt handlers but if at the very end
of the time slice the idle thread is scheduled on CPU the whole slice is
counted as idle time!

See also:
[https://github.com/torvalds/linux/blob/master/Documentation/...](https://github.com/torvalds/linux/blob/master/Documentation/cpu-
load.txt)

------
alain94040
The article is interesting, but IPC is the wrong metric to focus on. Frankly,
the only thing we should care about when it comes to performance is time to
finish a task. It doesn't matter if it takes more instructions to compute
something, as long as it's done faster.

The other metric you can mix with execution time is energy efficiency. That's
about it. IPC is not a very good proxy. Fun to look at, but likely to be
highly misleading.

~~~
GuB-42
The idea here is that it appears that you are limited by the CPU and want to
make things faster.

When doing optimized code, the question "should I optimize for memory or for
computing?" comes up often. Should I cache results? Should I use a more
complex data structure in order to save memory or improve locality?

IPC is a good indicator on how you should tackle the problem. High IPC means
you are may be doing too many calculations, while low IPC means that you
should look at your memory usage. BTW, most of the time, memory is the
problem.

------
westurner
Instructions per cycle:
[https://en.wikipedia.org/wiki/Instructions_per_cycle](https://en.wikipedia.org/wiki/Instructions_per_cycle)

What does IPC tell me about where my code could/should be async so that it's
not stalled waiting for IO? Is combined IO rate a useful metric for this?

There's an interesting "Cost per GFLOPs" table here:
[https://en.wikipedia.org/wiki/FLOPS](https://en.wikipedia.org/wiki/FLOPS)

Btw these are great, thanks:
[http://www.brendangregg.com/linuxperf.html](http://www.brendangregg.com/linuxperf.html)

( I still couldn't fill this out if I tried:
[http://www.brendangregg.com/blog/2014-08-23/linux-perf-
tools...](http://www.brendangregg.com/blog/2014-08-23/linux-perf-tools-
linuxcon-na-2014.html) )

~~~
gefh
It's not that your code should be async, but that it should be more cache-
friendly. It's probably mostly stalled waiting for RAM.

~~~
westurner
Oh, is this because of context switching for resource staring?

~~~
adrianN
[http://stackoverflow.com/questions/16699247/what-is-cache-
fr...](http://stackoverflow.com/questions/16699247/what-is-cache-friendly-
code)

------
surki
Another related tool I found interesting: perf c2c

This will let us find the false sharing cost (cache contention etc).

[https://joemario.github.io/blog/2016/09/01/c2c-blog/](https://joemario.github.io/blog/2016/09/01/c2c-blog/)

------
jarpineh
By clicking through some links on the article I found this:
[http://www.brendangregg.com/blog/2014-10-31/cpi-flame-
graphs...](http://www.brendangregg.com/blog/2014-10-31/cpi-flame-graphs.html)

Now I wonder how easy and manual work it would be to do these combined
flamegraphs with CPI/IPC information? My cursory search didn't find nary a
mention after 2015... Perhaps this is still hard and complicated.

To me it seems really useful to know _why_ a function takes so long to work
(waiting or calculating) and not "merely" how long it takes. Even if the
information is not perfectly reliable nor can't be measured without effect on
execution.

------
jeevand
Interestingly IPCs are also used to verify new chipsets in embedded companies.
Run the same code with newer generation chipset and see if IPC is better than
the previous. IPCs are one of the main criteria if the new chipset is a hit or
miss (others are power..)

~~~
brendangregg
Interesting. That must get messed up whenever they change the architecture in
a major way, eg, Intel Skylake going from 4-wide to 5-wide.

~~~
jeevand
AFAIK this is used in embedded chipsets that come with preloaded software like
modems.

------
glandium
I didn't know about tiptop, and it sounds interesting. Running it, though, it
only shows "?" in Ncycle, Minstr, IPC, %MISS, %BMIS and %BUS colums for a lot
of processes, including for, but not limited to, Firefox.

~~~
brendangregg
On the cloud? On AWS EC2? The PMCs required for tiptop to work were only
enabled last week, for dedicated full-host systems:
[http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-
ec2....](http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html)

~~~
glandium
On my laptop. And I do get values for some processes. But not all of them.

------
joosters
I can't see a mention of it here, or on the original page, so IMO it's worth
pointing out a utility that you will most likely already have installed on
your Linux machine: _vmstat_. Just run:

    
    
       vmstat 3
    

And you'll get a running breakdown of CPU usage (split into user/system), and
a breakdown of 'idle' time (split into actual idle time and time waiting for
I/O (or some kinds of locks).

The '3' in the command line is just how long the stats are averaged over, I'd
recommend using 3+ to average out bursts of activity on a fairly steady-state
system.

------
toast0
CPU util might be misleading, but cpu idle under a threshold at peak [1] means
you need more idle cpu and you can get that by getting more machines, getting
better machines, or getting better code.

Only when I'm trying to get better code, do I need to care about IPC, and
cache stalls. I may also want better code to improve the overall speed of
execution, too.

[1] (~50% if you have a pair of redundant machines and load scales nicely,
maybe 20% idle or even less if you have a large number of redundant machines
and the load balances easily over them)

------
gpvos
The server seems overloaded (somewhat ironically). Try
[http://archive.is/stDR0](http://archive.is/stDR0) .

------
xroche
> You can figure out what %CPU really means by using additional metrics,
> including instructions per cycle (IPC)

Correct me if I am wrong, but this won't work for spinlocks in busy loops: you
do have a lot of instructions being executed, but the whole point of the loop
is to wait for the cache to synchronize, and as such, this should be taken as
"stalled".

~~~
IIIIIIIIIIII
Which is mentioned in the article (which few people fully read of course)!

See under "Other reasons CPU Utilization is misleading"

> _Spin locks: the CPU is utilized, and has high IPC, but the app is not
> making logical forward progress._

------
deegu
CPU frequency scaling can also lead to somewhat unintuitive results. On few
occasions I've seen CPU load % increasing significantly after code was
optimized. Optimization was still actually valid, and the actual executed
instructions per work item went down, but the CPU load % went up since OS
decided to clock down the CPU due to reduced workload.

------
jwatte
I think thinking about the CPU add mainly the ALU seems myopic. The job of the
CPU is to get data into the right pipeline at the right time. Waiting for a
cache miss means it's busy doing its job. Thus, CPU busy is a reasonable
metric the way it is currently defines and measured. (After all, the memory
controller is part of the CPU these days.)

~~~
IIIIIIIIIIII
It also is reasonable to know about cache misses so that you can do something
about that _if you decide that 's possible_. Do you think because it is not
always valuable information, maybe even rarely (when you average over all
programmers worldwide, most of whom do web stuff) that it never is, for anyone
ever?

------
kazinator
This article is not as silly as it could be.

Let me help.

Look, CPU utilization is misleading. Did you forget to use -O2 when compiling
your code? Oops, CPU utilization is now including all sorts of wasteful
instructions that don't make forward progress, including pointless moves of
dead data into registers.

Are you using Python or Perl? CPU utilization is misleading; it's counting all
that time spent on bookkeeping code in the interpreter, not actually
performing your logic.

CPU utilization also measures all that waste when nothing is happening, when
arguments are being prepared for a library function. Your program has already
stalled, but the library function hasn't started executing yet for the silly
reason that the arguments aren't ready because the CPU is fumbling around with
them.

Boy, what a useless measure.

~~~
wzdd
> Boy, what a useless measure.

The article makes the perfectly-reasonable observation that "CPU usage" also
includes the time spent waiting on memory, that this time (as a %) has
increased significantly in modern processors, and that visibility into stall
time is potentially actionable to increase efficiency. It's useful to think
about the continued relevance of performance metrics which were introduced in
the 1970s and have continued basically unchanged into 2017. The article does
not claim that "CPU usage", as a metric, is useless. Your comment is
satirising a strawman.

~~~
kazinator
It isn't called useless, just "misleading" and "wrong". How useful is
something if it is both?

If I have two programs which scan through a 100 Gb file to do the same
processing, and both take the same amount of time (the task is I/O bound),
which is more efficient?

Let's see: one uses 70% CPU, the other 5%: clear.

How is that wrong; where are we misled?

This is simply telling us that although the elapsed wall time is the same, one
program is taking longer. We could look at the processor time instead of the
percentage: that's just a rearrangement of the same figures. That 70% is just
the 0.7 factor between the processor time and real time.

It is the article that is fighting a strawman because nobody uses raw CPU time
or utilization to measure how well a program execution is using the internal
resources of the processor.

I have never seen anyone claim, based only on processor time alone and nothing
else, that a program which takes longer to do the same thing as another one is
due to a specific cause, like caches being cold, or a poorer algorithm. That
person would be misled and wrong, not CPU time _per se_.

~~~
scott_s
brendangregg is clearly trying to educate people who may not be aware of the
difference between "CPU is busy" and "CPU is doing useful work." You already
know this, and that's okay! You're not the target audience. I concede his
title was slightly on the side of "I'm going to say something simple and
catchy, then walk it back a bit with a nuanced explanation", but I'm willing
to give him slack considering his numerous
([http://www.brendangregg.com/portfolio.html](http://www.brendangregg.com/portfolio.html))
contributions to the literature for analyzing systems performance.

------
gens
The core waiting for data to be loaded from RAM _is_ busy. Busy waiting for
data.

Instructions per cycle can also be misleading. Modern cpu's can do multiple
shifts per cycle, but something like division takes a long time.

It all doesn't matter anyway, as instructions per cycle does _not_ tell you
anything specific. Use the cpu-builtin performance counters, use perf. It
basically works by sampling every once in a while. It (perf, or any other tool
that uses performance counters) shows you exactly what instructions are taking
up your processes time. (hint: it's usually the ones that read data from
memory; so be nice to your caches)

It's not rocket surgery.

~~~
brendangregg
> _The core waiting for data to be loaded from RAM is busy. Busy waiting for
> data._

A) it can run another hyperthread with those cycles, and,

B) even without hyperthreads, it misleads the developers as to where to tune.

C) also, "busy waiting" sounds like an oxymoron. DRAM is busy. The CPU is
waiting. Yes, the OS doesn't differentiate and calls it %CPU.

> _It all doesn 't matter anyway, as instructions per cycle does not tell you
> anything specific._

It provides a clue as to whether %CPU is instruction bound or stall bound.

> _Use the cpu-builtin performance counters, use perf. It basically works by
> sampling every once in a while._

That's called sampling mode. The example in my post used counting mode.

Edit: added (C).

~~~
fulafel
I think A is relatively uncommon, as long as mainstream CPUs only have 2
threads. It may give you a boost when you happen to have between N and N*2
runnable threads, for N = number of cores. And those idle CPU threads were
already showing up as idle logical processors in top so you would have had a
good reason to try to employ them earlier already.

(terminology nitpick: HyperThreading is just Intel's proprietary trademark for
SMT)

------
taeric
This is silly. The conceit that ipc is a simplification for "higher is better"
is exactly the problem he has with utilization.

True, but useful? Most of us are busy trying to get writes across a networked
service. Indeed, getting to 50% utilization is often a dangerous place.

For reference, running your car by focusing on rpm of the engine is silly.
But, it is a very good proxy and even more silly to try and avoid it. Only if
you are seriously instrumented is this a valid path. And getting that
instrumented is not cheap or free.

~~~
IIIIIIIIIIII
This is a gross misrepresentation of the article. It is in _your own head_
that he solely focuses on this one thing. When somebody says "this soup needs
more salt" you start talking about how there always is too much salt in
processed food these days? The article is a about a concrete narrow subject,
not about "performance" (the entire field).

Can't some people read an article without extrapolating to the end of the
known universe and just stick to _just_ the article's actual narrow subject?

~~~
taeric
Take my criticism as being on the chosen title and framing.

It is a neat subject. Just as knowing different ways of measuring car
utilization and power transfer. Framing it as a take down of how people
commonly do it is extreme and at major risk of throwing out the baby with the
bath water.

------
alkonaut
Is there any easy way to do profiling that reveals stalled cpu becasue of
pointer chasing, for "high level devs" on windows?

------
heisenbit
Any way to do something equivalent on OSX?

------
buster
This was very enlightening. I have the highest respect for Brendan and his
insights, i must say.

------
JohnLeTigre
or your code could be riddled with thread contentions

I guess this is why he used the term likely

Interesting article though

------
willvarfar
Using IPC as a proxy for utilization is tricky because an out-of-order machine
can only get that max IPC if the instructions it is executing are not
dependent on not-yet-computed instructions.

In-order CPUs are much easier to reason about; you can literally count the
stalled cycles.

------
spullara
Need a new metric "CPU efficiency".

------
nhumrich
Totally disagree with the premise of the article. Every metric tool that i
know of that shows cpu utilization correctly shows cpu work. Load on the other
hand represents cpu and iowait (overall system throughput). Io wait is also
exposed in top as the "wait" metric. An amazon EC2 box can very easily get to
load(5) = 10 (anything above 1 is considered bad), but the cpu utilization
metric will still show almost no cpu util.

~~~
brendangregg
Sorry, I'm not talking about iowait at all. I think you're thinking about load
averages, which I'm not talking about in that article. I'm actually talking
about memory stalls.

------
flamedoge
> If your IPC is < 1.0, you are likely memory stalled,

depends on the workload.

~~~
IIIIIIIIIIII
It's always nice to see when people take even less than a complete sentence
and pretend that this is what the author said and that this was all.

~~~
qb45
Better yet if the response is even shorter than the excerpt quoted ;)

------
tjoff
Well, this is the reason I hate HyperThreading, does your app consume 50% or
100% - with hyperthreading you have no clue.

And that is per core, it becomes increasingly meaningless on a dualcore and on
a quadcore and above you might as well replace it with MS Clippy.

And this is before discussing what that percentage really means.

edit: I'm interpreting the downvotes that people are in denial about this ;)

~~~
sqeaky
Different task monitors make this clear in their docs and they tend to stick
with it. Read up on the tools you use once and this source of mystery will be
gone forever.

As far as hyperthreading goes, not understanding some is a bad reason to
dislike it. For most developer purposes, pretend it is another CPU until you
learn enough about profiling multithreaded apps that you need to care about
the difference.

~~~
tjoff
My comment was from a user perspective. I don't mind the performance increase
from HT (and that is why I still have it enabled).

But, it really bothers me that it gets much harder to assess both the total
load and the load produced by a single process. I haven't found any useful
info on how to deal with this in simple overview user-level tools such as top
or the windows task manager.

 _What are the disadvantages?

CPU time accounting measurements (sys/usr/idle) as reported by standard tools
do not reflect the side-effects of resource sharing between hardware threads

It is impossible to correctly measure idle and extrapolate available computing
resources_

[https://blogs.oracle.com/partnertech/cpu-utilization-of-
mult...](https://blogs.oracle.com/partnertech/cpu-utilization-of-multi-
threaded-architectures-explained)

