
Test results for Knights Landing - ingve
http://agner.org/optimize/blog/read.php?i=761
======
gpderetta
Agner says that he sees no point in hyperthreading. Then he also complains
that most AVX512 operations have a latency of 2 clock cycles.

Those two, I think, go hand in hand: the point is not using unused execution
resources with the additional hyperthreads (thus the decode limitation), but
using the other threads to fill the pipeline bubbles created by the higher
latency instructions, a-la barrel processor.

edit: this of course works as KNL is aimed at throughput jobs, not anything
latency sensitive.

~~~
jandrewrogers
This is correct, it is not possible to get peak IPC without using all the
hyperthreads, by design. It is not a CPU even though it uses a CPU's ISA, so
treating it as a CPU for optimization purposes is a mistake.

The advantage of its microarchitecture is that if you use the hyperthreads
correctly, it will run a broad range of code at close to the theoretical IPC
of the silicon without too much code effort (if you understand the model).
This is in contrast to CPUs, which rarely get close to their theoretical IPC
no matter what you do, or GPUs, which can only run a very narrow range of
codes at close to theoretical IPC. In principle it is extremely efficient in
terms of operation throughput, and it isn't sensitive to what those operations
are, but it requires a large number of independent operations to be in flight
to do its magic hence all the hypertheads.

While the architecture explicitly uses latency-hiding to get throughput, it is
mostly about latency-hiding at the sub-microsecond level. As a practical
matter, it shouldn't affect the perceived latency of most real-world software.

~~~
gnufx
I've no idea in what way KNL isn't a CPU, or how optimizing for it is
fundamentally different, from running one in an HPC setting. For vectorized
floating point code of the sort for which we're particularly interested in
these things (e.g. DGEMM), you get peak performance from a single thread/core.

The design is supposed to be "balanced", and it does appear to do a reasonable
job of that for the sort of code that uses the bulk of the time on our system.
In some cases, Broadwell will do substantially better, of course. I don't have
URLs for performance examples to hand.

People mostly seem to be ignoring the potentially important additions in KNL
-- the memory system and the built-in interconnect (though I don't know if the
latter is available yet). Also the large core count should help by keeping
more MPI communication local. I don't know of any relevant results for multi-
node jobs, but 64 cores covers a fair number of the HPC jobs I see.

------
nkurz
In case anyone with pull at Intel is reading, let me say this:

It strikes me as insane that you are not making it easier for Agner to get
early access to your hardware. He's writing fabulous manuals for free that
fill the gaps in the official sources. His work helps more programmers to get
the most out of your hardware than just about anything else available. You
should be throwing money at him to keep him doing this, or at least showering
him offers of free pre-release hardware. But instead you are depending on his
generosity and leaving it to chance whether he eventually finds a machine to
test on.

Why?

~~~
imode
I'm going to assume it's because Intel don't value the community surrounding
their products very well. at least, the ones that append to their work in a
way.

a lot of companies tend towards that behavior. a lot of single developers tend
towards that behavior. I don't like it because it represents a kind of
destructive egotism.

hopefully that's not what it is and it's just Intel not thinking too hard.

~~~
dbcurtis
Well, it's been a few years since I worked at Intel, but I'm guessing it's
just a misalignment of incentives. It's not that some people inside Intel
don't realize who their friends are, it's just that those who do realize who
their friends are do not get measured on treating their friends well. Nor does
their boss, nor their boss's boss. Until their boss's boss _does_ get measured
on how well they treat Intel's friends. Then the friends get smothered in a
bunch of distracting attention.

You get what you measure for. Intel measures a lot. Intel fails to measure
some important things.

The same is probably true of any company with hyper-focused management. Maybe
even your startup.

------
gonewest
Seems to me Intel is placing their bet on the compiler doing the lion's share
of the work here -- and where hand optimization is needed, another bet that
having a consistent instruction set across their big Xeon cores and these MIC
devices will make things easier for developers?

~~~
gpderetta
Actually the whole masking and scatter-gather in AVX512 is to simplify the job
of the compiler by pretty much allowing any loop to be vectorized relatively
trivially.

It doesn't really require any new compiler breakthrough, as it is pretty much
what all the GPU compilers (i.e. cuda) have been doing for a while already.

~~~
greglindahl
Not to mention scatter-gather and masking in Cray's vector processors. You do
still need dependency understanding in the compiler, but that's pretty easy
now compared to the late 1970s.

