
How to achieve 4 flops per cycle? - glazskunrukitis
http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle
======
rrrrtttt
Getting the theoretical peak performance requires either extensive manual
tuning on each architecture or exhaustive search. To read on the former see
[http://www.nytimes.com/2005/11/28/technology/28super.html?pa...](http://www.nytimes.com/2005/11/28/technology/28super.html?pagewanted=all),
and the latter is done for example in the ATLAS BLAS.

It is important to note that the SO discussion is focused on achieving peak
flops when the data is already in the XMM registers. The real bottleneck is
pushing the data quickly through the levels of the memory hierarchy.

~~~
montecarl
Its worth noting that ATLAS is actually pretty terrible compared to hand
written BLAS.

Figure 2 and 3 in this paper
([http://cran.r-project.org/web/packages/gcbd/vignettes/gcbd.p...](http://cran.r-project.org/web/packages/gcbd/vignettes/gcbd.pdf))
show atlas is only better than an unoptimized ("reference") BLAS
implementation and that it is many times slower than the winner GotoBLAS (now
OpenBLAS <http://xianyi.github.com/OpenBLAS/>), which is written assembly by
Kazushige Goto while he worked at the Texas Advanced Computing Center.

Edit: Article in the NY Times about Mr. Goto
[http://www.nytimes.com/2005/11/28/technology/28super.html?sc...](http://www.nytimes.com/2005/11/28/technology/28super.html?scp=1&sq=Kazushige%20Goto&st=cse&_r=0)

------
raverbashing
This is a very interesting exercise

CPUs are very capable today, unfortunately, to get 'every last drop of
performance' is very difficult.

I'd say that CPUs are more complex than compilers can optimize the code for
them.

Of course, manually optimizing for something like this is "easy" (if you have
a somewhat deep knowledge of Intel's manuals)

Now for "everyday computing" this gets tougher, even though compilers do a
good job (good, not great) and it's usually good enough.

So you end up going to the deeper details in time-sensitive things: games,
video processing, etc

There are some tools (from Intel and AMD, even though I tested only the AMD
one some time ago) that tells you 'everything' that you need to know about how
good your code is. For example, IIRC, if you load a register then immediately
store to it (or store then load, I don't remember), there's a stall, so you
can do something else in between

------
TallboyOne
I always get a mild wave of depression after reading stuff like this, because
of how absolutely much I do not know.

------
martinced
I'm a bit surprised by the most voted answer by a user who has 80K rep on SO.

By this part:

 _"If you decide to compile and run this, pay attention to your CPU
temperatures!!! Make sure you don't overheat it. And make sure CPU-throttling
doesn't affect your results!"_

Fair enough for the CPU throttling.

But did the nineties just call? Computers dying due to CPU overheating was all
scary and spooky back when we were running old AMDs and old Pentium CPUs but I
haven't own any CPU that could die due to overheating since a very, very long
time. They do automagically throttle to: a) make sure they stay within their
TDP specs b) make sure they don't melt.

I mean: honestly, scientists should start to worry about overheating their CPU
that are going to melt?

~~~
askimto
I'm a bit surprised by your smug attitude.

~~~
4ad
So being correct, raising an interesting issue and discarding urban myths is
smug now?

