
Operation Costs in CPU Clock Cycles - MattHarrington
http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/
======
eloff
A syscall is a lot cheaper than shown here in terms of direct costs ~ 150
cycles, and also a lot more expensive, depending on the call, when you factor
in the full cost of the cache that gets clobbered. More like the full cost of
a context switch shown, around 10-30K cycles[1]. For this reason it's
important to use system calls that allow you to amortize more work into one
call, like preadv, pwritev, recvmmsg, and sendmmsg.

[1]
[http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/papers/...](http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/papers/flexsc-
osdi10.pdf)

~~~
bogomipz
>"and also a lot more expensive, depending on the call, when you factor in the
full cost of the cache that gets clobbered."

But a syscall doesn't necessarily mean a context switch right? It's just a
mode switch if the kernel is servicing the syscall on behalf of the same
process. The cache wouldn't get clobbered then because the kernel is mapped
into the top of every process, for just this reason.

Or am I misunderstanding what you are saying?

~~~
fdsaaf
You are correct. The author is considering the cost of accessing kernel mode
code and data, which I don't think is fair

~~~
eloff
It absolutely is fair. For a long time after a syscall your program will run
slower thanks to the increased cache misses from the kernel clobbering your
stuff in cache. That's a real price that you pay for a syscall and it would be
wrong not to count it. However, if you found some other way of doing the work,
e.g user code, kernel bypass, amortized bulk syscall, those will also have a
(lesser) effect on the cache. So to be fair you compare a syscall against
that, not against zero.

~~~
JoshTriplett
If you have enough CPU cache, the CPU will cache both kernel and user
code/data for your process. If you run long enough and access enough user data
that the kernel bits get evicted, then you'll take more of a performance hit,
but the same thing applies to accessing enough user data that different parts
of your user data get evicted.

~~~
eloff
Maybe this is more reflective on the kinds of software I work on, but I don't
find that I have enough cache. So syscalls always come at a steep price. In
fact I am tempted to break up one program into multiple processes and use the
cache partitioning in Xeon v4 to prevent the different parts of my program
from clobbering it's own cache.

------
digi_owl
Anyone have the presentation from a Intel guy on how the CPU design focus has
moved from cycles to cache misses handy?

Edit: never mind, it was not a Intel guy. And i actually had the thing
bookmarked (and it still worked).

[https://www.infoq.com/presentations/click-crash-course-
moder...](https://www.infoq.com/presentations/click-crash-course-modern-
hardware)

~~~
vcarl
There's a fantastic, massively upvoted StackOverflow post that can also
provide some insight here. This may be a little more accessible, since it's
such a significant runtime difference with very simple source code.

[http://stackoverflow.com/questions/11227809/why-is-it-
faster...](http://stackoverflow.com/questions/11227809/why-is-it-faster-to-
process-a-sorted-array-than-an-unsorted-array)

~~~
globuous
Thanks so much for sharing this, great read !

For those, like me, that want to play with what this stackoverflow talks
about, here's a fiddle of it:
[https://jsfiddle.net/tbinetruy/Latkmk2q/1/](https://jsfiddle.net/tbinetruy/Latkmk2q/1/)
(code takes 2s to run and loads firebug for console logs).

------
quotemstr
Honestly, it's a lot more complicated than that, and the article itself
presents a pretty complicated (and flawed) model. There's no substituting for
benchmarks.

Really, if you want a quick and dirty heuristic, the smallest program is
likely the fastest. Everything else will bite you in the ass.

Anti exception stuf: the author repeats the old harmful "exceptions for
exceptional cases" bromide while overselling the cost of exception dispatch
(5k cycles is cheap in the scheme of things) and underestimating (or, rather,
not mentioning) the icache cost of error checking.

Now that I'm looking, the author glosses over many other things as well, like
modern system call interfaces (no exceptions!), cache pressure generally, and
branch prediction efficacy. These factors are strong arguments _against_
generating reams of template code using CRTP.

Like most analyses that purport to show the "true" cost of anything on modern
superscalar virtual memory systems, it oversimplifies.

~~~
no-bugs
> Honestly, it's a lot more complicated than that, and the article itself
> presents a pretty complicated (and flawed) model. There's no substituting
> for benchmarks.

Of course, all the models are inherently flawed, and in theory there is no
substituting for benchmarks, but going along the lines of "all models out
there are flawed, so we don't need ANY model and should do benchmarking
instead" leads to the software which is badly architectured to start with (and
then no benchmarking will be able heal it without throwing the whole thing
away and rewriting it from scratch - and it is not going to happen for most of
the projects for many practical reasons).

In other words, if we're already in the vicinity of the optimum (which BTW may
happen to be a local one - but this is a very different story) - then
benchmarking is the way to go (that is, if those 20% we're chasing, are worth
the trouble - and TBH, they really are). However, at the earlier stages being
able to realise the order of magnitude without spending time on trying really
badly architectured things - is extremely important.

Just one example - no optimisation will save a (non-GPGPU) program which tries
to split the load over multiple threads with chunks calculated on each thread
(before it relinquishes control to another thread), being 100 cycles or so; I
don't need to benchmark it to know that at least in 99.9999% of real-world
projects it is a suicide from performance point of view - and this observation
directly follows from the diagram.

Or another example - in most of practical cases (actually, pretty much always
except for inserts into the middle without iterating through and specialized
stuff such as merges), vectors have advantage over lists - and this advantage
can easily reach over 100x(!) ( see, for example,
[https://accu.org/index.php/journals/2268](https://accu.org/index.php/journals/2268)
). This, once again, follows from the diagram (it is a direct consequence of
the main memory read being over 100 cycles and L1 read being 3-4 cycles).

This list of crazily inefficient things (which are still widely used in the
industry for no reason) can go on and on. Of course, going into anywhere more
detailed analysis than "order of magnitude" indeed doesn't make sense - and it
is clearly stated in OP. But for initial estimates when making some
architectural decisions (and to avoid premature pessimisation) - such "order
of magnitude" analysis can be very useful.

------
trsohmers
While this is a pretty decent overview, it is not very precise, and even a bit
misleading in certain cases. Depending on where certain instructions are in
the execution pipeline, you can. R limited in what can be issued and executed,
and the latencies are not deterministic. This is one of the biggest
difficulties in optimizing scheduling in a complex architecture such as x86,
and gets even worse (determinism wise) when you throw in cache misses and
branch prediction.

For a more exhaustive instruction latency listing over a variety of micro
architectures, check out [0]. [1] is also a great resource for memory
latencies for a variety of processors.

[0]:
[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)

[1]: [http://www.7-cpu.com](http://www.7-cpu.com)

------
logicallee
For people who make interactive applications, here is an interesting thing to
add:

Making a user interpret an error message or notification of some kind (8
seconds) and have to dismiss a window (3 seconds) rather than be able to
continue seamlessly:

44,000,000,000 operations, or 275,000,000,000 operations if we include the
idling GPU.

Distance light travels in this time: around the world 77 times, all while we
keep the user from being able to do anything.

Something to think about the next time you throw up a dialog box...

~~~
taneq
If you want to put it that way, you can think of information in the user's
brain as being stored in peripheral memory at the end of a horribly slow,
high-latency, unreliable bus. :)

~~~
logicallee
This is really good phrasing!!! I love it.

Especially when you need the user to take an action: the API isn't well-formed
(for example a mix of visual cues, text, and UX behavior) but at the end of
the day if you give the user something to do (example: you want them to
acknowledge and dismiss a notification or make a choice) there are multiple
ways of doing it, some much faster and others much slower.

This is especially interesting when you need the user to deal with your bugs
(e.g. errors of various kinds). It's not really part of what _they 're_ trying
to do - you're the one that needs them to react somehow.

very interesting way of thinking about it!

------
glangdale
In places, this is quite simple-minded, though there is a nod to the
distinction between throughput and latency. A function call may effectively
cost nothing; if the function and its caller do useful work, a few store/load
pairs (save and restore registers) and correctly predicted return address may
be a handful of microoperations merged in with a stream of other instructions.
The idea that there is 20-50 cycles of actual _cost_ from a function call is
outlandish.

Similarly, the virtual function call number looks way out of whack as long as
the indirect branch target is correctly predicted (which, if the use of
virtual functions is substituting for a more straightforward approach, is
likely - and if it's quite unpredictable where the indirect branch in your
virtual function call would go, any other approach probably just moves the
branch mispredict somewhere else).

The numbers are not overall "crazy-wrong", but there is a tone of precision
and certainty to them that is a bit deceptive. Talking about how much an
operation costs is pointless; build things one way, measure that, then make
the smallest change possible and measure that. Most of these costs make sense
only in context (if you're on a critical path, measure latency, if you're on
oversubscribed execution resource, measure throughput, etc).

~~~
no-bugs
> may be a handful of microoperations merged in with a stream of other
> instructions.

It may indeed in theory, but most of the time in real-world it won't. And BTW
- let's not forget about implicit costs of being unable to inline (which can
be huuuuuge). Speaking of the real-world - we've just got our article
accepted, presenting supposedly the fastest universal hashing function as of
today (in spite of math being more heavy compared to the existing ones) - and
the numbers in the OP are consistent with our real-world experiences while we
were optimising it (well, within an order of magnitude at least).

> but there is a tone of precision and certainty to them that is a bit
> deceptive.

OP: "Last but not least, a word of caution: all the estimates here are just
indications of the order of magnitude".

> Talking about how much an operation costs is pointless; build things one
> way, measure that, then make the smallest change possible and measure that.

Sure. The problem is that almost-nobody has time to do it in real-world
projects. Which leads to even cruder estimates (such as "virtual function
calls costs are negligible, regardless of the number of times they're called")
being used - causing lots of crazily inefficient programs. IMO, OP is a
reasonable middle ground between an all-out quasi-stationary testing and even
worse guesstimates ;-).

~~~
glangdale
I have been working in this area for 10 years, made millions of dollars of
revenue from a product that was largely dependent on getting this sort of
thing right, sold a startup to Intel whose main business was in this area, and
work for Intel. I'm not 100% sure you are well situated to lecture me about
the 'real world' because you wrote an academic paper.

Congratulations on getting a paper accepted (and would be interested in
reading about it, as I love hashing work) but your claims about "most of the
time in real-world it won't" is nonsense. The typical x86 function call
overhead is a correctly predicted call, some register saves (using an
optimized stack engine), the real work, and some register restores (using the
same optimized stack engine). This is not typically 15-30 cycles worth of
overhead. The loads and stores generally go to and from L1 cache, are
generally predictable (if the function itself is predictable) and none of the
operations that are part of a conventional function call are all that
expensive. In 15-30 cycles you can do 30-60 loads and/or 15-30 stores (mileage
varies by uarch) and 60-120 integer operations if you are very, very lucky.
Compare this with the typical argument setup, function prolog/epilog, etc.

As you hint at, function call overhead generally comes from interrupting live
ranges (forcing save/restore or simply causing the register allocator to pick
a worse strategy) and losing the opportunity to optimize across function
boundaries - this cost can be _enormous_ , nebulous and not even a constant
(it will impose costs repeatedly on the program and isn't even a constant). I
have code where the cost of sticking a function call in and losing some
constant propagation information is millions of cycles per call, not 15-30. In
other places the cost of a call is effectively zero.

In still other places the cost of a function call is 'negative' \- that is,
it's cheaper to have a function exist as a independent unit than inline it.
This is typically an i-cache issue but we've seen a host of weird effects
here.

So - under some (unusual) circumstances, function call overhead can be
practically free (i.e. the ooo engine has a lot of opportunities to insert
prolog/epilog instructions, no branch mispredicts, etc). Typically it will be
inherently cheaper than 15-30 cycles - but lost opportunities for optimization
may take you to numbers that are insanely higher than that.

The reason I arced up over this is that it's just not useful to say "within an
order of magnitude". If you say 15-30 is the magic number you are creating
folklore. We are burdened by folkloric stuff ("the interpreter penalty that
results because branch predictors 'can't predict indirect branches'") that
result in considerably worse designs. It's better to know you don't know
rather than promulgate simplistic rules-of-thumb that misstate the real
issues.

This is particularly true because a lot of this 'folkloric optimization' stuff
people do generally leads away from a simple and direct expression of their
ideas in their favorite coding style, and towards a heavily "optimized" form
that's super-obscure because some "performance high priest" has declared it
Performant. We've had a number of cheap laughs in our day re-rolling loops,
replacing inline asm with C code, and restoring sanity and getting a
performance _win_ from doing it. I've been just as guilty of playing
Performance High Priest as anyone else, mind you.

~~~
no-bugs
> made millions of dollars of revenue... I'm not 100% sure you are well
> situated to lecture me about the 'real world' because you wrote an academic
> paper.

So, we're going to discuss millions of dollars instead, sigh... This way Trump
should be one of the best programmers in the universe, I guess.

> Typically it will be inherently cheaper than 15-30 cycles - but lost
> opportunities for optimization may take you to numbers that are insanely
> higher than that.

And still, even from your own rant it follows that in vast majority of cases
the estimate of 15-30 cycles will be well within "order of magnitude" (TBH, I
didn't see millions myself, but it should be a really strange corner case).

> If you say 15-30 is the magic number you are creating folklore... We are
> burdened by folkloric stuff... that result in considerably worse designs.

And not having any such "folklore" results in even worse designs :-( (actually
- MUCH worse ones). Using list instead of vector can easily give you 100x
penalty for absolutely zero reason (actually, up to 780x was experimentally
observed - and that's without swapping). Off-loading 100-cycle chunks (with a
thread context switch back after calculating these 100 cycles) to a different
thread will never work at least on a x64 (though I remember meeting some folks
from Intel - I think they were representing an OpenMP team - who were
seriously preaching otherwise, based on utterly silly "how many cores we
managed to utilise" metric without realising that the whole thing became
_slower_ after they parallelised it ;-( ). And so on and so forth.

Sure, the numbers are very rough. But trying to say that "hey, it is not
precise so let's not even try to estimate" \- is even worse than that.

~~~
glangdale
I don't think Trump made his money doing low-level performance programming for
the past 10 years, so I'm not sure your analogy is valid.

However, since you, whoever you are, have not only written a hash table, but
discovered profundities like 'Sometimes list costs 780x as much as vector for
"absolutely zero reason"', and 'don't try to offload 100 cycles of work to
another thread' I'm going to defer to your expertise. I recommend you stick a
bone through your beard, pronounce yourself a performance guru, and make bank.
Have fun.

~~~
no-bugs
> I recommend you stick a bone through your beard, pronounce yourself a
> performance guru, and make bank. Have fun.

:-) :-) I LOVE when my opponent has to resort to personal insults :-). Leaving
aside any sarcastic remarks in this regard:

For Intel's sake I Really Hope that these "profundities" are indeed very well-
known to you - and believe it or not, they're very well-known to me too for at
least 10 years. However, this is not the point; the point is that there are
LOTS of developers out there who do NOT know them - and the OP is intended for
them (and not for "performance gurus").

It is actually THIS simple. Eliminating 90% of inefficiency does not really
require black magic or "performance gurus" who know exactly how the pipeline
of specific CPU works. And this is exactly what I'm arguing for - to educate
app-level developers and architects about this low-hanging fruit of 10x+
inefficiencies; I can assure you that it is very far from being universal
knowledge in app-level development circles.

------
CountSessine
This is a fantastic resource; kudos to the author. But there is one thing in
this reference that I found unexpected:

 _One further thing which is related to memory accesses and performance, is
rarely observed on desktops (as it requires multi-socket machines – not to be
confused with multi-core ones ... When multiple sockets are involved, modern
CPUs tend to implement so-called NUMA architecture, with each processor (where
“processor” = “that thing inserted into a socket”) having its own RAM_

I thought that all Intel chips since Nehelem divided their SDRAM access into a
NUMA-configuration based on cores? Am I wrong about that?

~~~
jdub
I don't think there's much NUMA action on single socket at the moment, but as
CPU area increases and more of the transistors are not actually doing CPU work
(to spread out the heat-making bits) which increases distances on a single
die, this will change.

~~~
kikoreis
Unless there is core-specific RAM on the die, why? Isn't the essential aspect
of NUMA the fact that there is some memory which is "near", and some which is
"far"?

~~~
jdub
Yeah, as distance (latency) to RAM increases, the amount of on-die cache
increases (another handy way to distribute heat with a performance bonus) and
coherency becomes more costly, so in effect it becomes the core-specific RAM
you mention.

(Oh, hi Kiko!)

~~~
kikoreis
Hey Jeff.. the nick fonts are really small on HN; I didn't see it was you!

I don't think latency to near RAM will increase; it would have too material a
performance impact. Even in disaggregated designs like Rackscale there is
definitely a concept of "near RAM", which is not cache, but which has very low
latency.

However, your post made me realize that as the number of cores go up, as with
KNL, they are likely to be organized hierarchically with some clustered
sharing of cache, so indeed NUMA-style affinity of workload to core starts
paying off there. IOW, if you have thousands of cores on a chip, they
definitely aren't going to be all sharing the same L2 and L3.

------
bogomipz
The post states:

"in particular, tcmalloc and ptmalloc2 allocators can take as little as
200-500 CPU cycles for allocation/deallocation of a small object"

Does anyone how many cycles the regular glibc malloc() takes?

~~~
nullc
I believe ptmalloc2 is glibc's malloc.

~~~
bogomipz
Yes you are correct. Does anyone know how many cycles that is comparatively?

------
scottlamb
I like this chart. In particular, I like that they break apart "thread context
switch (direct costs)" and "thread context switch (total costs, including
cache invalidation)". I've often heard that context switches are cheap because
people only consider the direct costs.

~~~
quotemstr
Context switches are cheap because context switches are cheap. You can, with
enough finnegaling, make the "indirect costs" of anything arbitrarily high.

------
jheriko
i'm curious where this information comes from. a lot of it jars heavily
against personal experience and measurements i can do for myself.

the idea that floating point division is that much more expensive than
multiplication for instance... the only difference afaik is latency, not
timing.

the idea that an indirect call and a virtual function call are so close as
well... when it is a read followed by an indirect call- whilst giving timings
for some of the reads are considerably greater than either is an utter
nonsense on inspection.

take with a great pinch of salt and remember the one correct way to judge
timings is to measure them in context instead of guessing based on information
that could well be wrong.

imo this kind of article is harmful and misses the more important lesson to
learn: measure, measure, measure.

~~~
fancy_pantser
There is a "references" section at the bottom:

[Agner4] Agner Fog, “Instruction tables. Lists of instruction latencies,
throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs”

[Agner3] Agner Fog, “The microarchitecture of Intel, AMD and VIA CPUs. An
optimization guide for assembly programmers and compiler makers”

[Intel.Skylake] “Intel® 64 and IA-32 Architectures Optimization Reference
Manual”, 2-6, Intel

[Levinthal] David Levinthal, “Performance Analysis Guide for Intel® CoreTM i7
Processor and Intel® XeonTM 5500 processors”, 22

[NoBugs] 'No Bugs' Hare, “C++ for Games: Performance. Allocations and Data
Locality”

[AlBahra] Samy Al Bahra, “Nonblocking Algorithms and Scalable Multicore
Programming”

[eruskin] [http://assemblyrequired.crashworks.org/how-slow-are-
virtual-...](http://assemblyrequired.crashworks.org/how-slow-are-virtual-
functions-really/)

[Agner1] Agner Fog, “Optimizing software in C++. An optimization guide for
Windows, Linux and Mac platforms”

[Efficient C++] Dov Bulka, David Mayhew, “Efficient C++: Performance
Programming Techniques”Amazon, p. 115

[Drepper] Ulrich Drepper, “Memory part 5: What programmers can do”, section
6.2.2

[TCMalloc] Sanjay Ghemawat, Paul Menage, “TCMalloc : Thread-Caching Malloc”

[Wikipedia.ProtectionRing] “Protection Ring”, Wikipedia

[Ongaro] Diego Ongaro, “The Cost of Exceptions of C++”

[LiEtAl] Chuanpeng Li, Chen Ding, Kai Shen, “Quantifying The Cost of Context
Switch”

------
fdsaaf
I am extremely skeptical of "full" rows in the table, the ones that purport to
measure the overall costs of cache invalidation. These costs are so workload
specific that a single number is meaningless and likely to mislead. My own
benchmarks show costs that are nowhere near the ones cited.

~~~
no-bugs
> My own benchmarks show costs that are nowhere near the ones cited.

This is heavily dependent on the benchmark. In the OP, there is a ref to an
academic research showing these numbers - and from my own real-world
experience (NOT artificial benchmarks), the costs of 10K-100K are very usual.
From a completely different perspective - there should be reasons why nginx
beats Apache performance-wise :-).

------
OskarS
I'm surprised how fast a C function call is. I would have thought that
creating a stack frame would be slower than that, (and significantly slower
than a floating point division), but I guess not.

~~~
umanwizard
Creating a stack frame is just updating a register and writing a couple of
values to L1 cache.

~~~
jheriko
sometimes not even that. it might just update a single register and write
nothing to cache or memory... and that cost might be hidden by out-of-order
execution and the parallelism of different units on the CPU, making it
effectively zero-cost.

------
raarts
Great overview! Would be even better as a general guideline if it included
costs for HD/SSD disk writes and network usage!

~~~
wmf
See the Jeff Dean numbers:
[https://gist.github.com/jboner/2841832](https://gist.github.com/jboner/2841832)

------
falcolas
The small difference between direct C calls and virtual C++ calls surprised
me, actually. I thought it would be much bigger.

~~~
acchow
A virtual call in C++ is just a vtable lookup then a direct C call. Literally
one pointer away. (or perhaps one pointer then one integer addition away? For
the offset)

~~~
chrisseaton
Is it that simple for virtual methods in complex class hierarchies? If the
class hierarchy isn't linear, how can you assign simple absolute vtable
indices for methods?

~~~
dom0
Yes it is. The C++ ABI defines how any permitted inheritance hierarchy is
mapped into a surjective, ordered set of vtables. Also, calls into methods of
other classes in the hierarchy offset the this pointer accordingly.

It feels counterintuitive that these two mechanisms are sufficient; but they
are.

