
Intel discloses “vector+SIMD” instructions for future processors - gpderetta
http://sites.utexas.edu/jdm4372/2016/11/05/intel-discloses-vectorsimd-instructions-for-future-processors/
======
hughw
Funny: "It is not clear that any compiler will ever use this instruction — it
looks like it is designed for Kazushige Goto‘s personal use."
[https://en.wikipedia.org/wiki/Kazushige_Goto](https://en.wikipedia.org/wiki/Kazushige_Goto)

~~~
derf_
This sounds like exactly the instruction needed for the inner loop of
xcorr_kernel(), the function at the heart of a bunch of algorithms used in the
Opus codec. This falls under the "convolution kernel" use-case described in
the article.

------
paulsutter
This instruction is surely targeted at deep learning applications.
Convolutonal layers take up the majority of the compute time of deep networks.

People seem optimistic that compilers will auto generate such instructions,
but even if a compiler could generate the instruction, you would need to
carefully organize your data structures to take advantage of it.

Pipelining is as important as SIMD in achieving peak flops on current
processors. You can do 16 flops in a single fmadd instruction, in five cycles.
But ten consecutive independent fmadds also take 5 cycles, but perform 160
flops. Getting a pipeline going like that requires very careful design of data
structures by the programmer.

Does anyone know if AVX512 will support fp16?

~~~
stuntprogrammer
Current publicly announced AVX512 does not support fp16. Skylake Server (SKX)
and Knights Landing (KNL) are at a disadvantage here. They've not publicly
said anything about extensions in Knights Hill (the long announced successor
to KNL).

That said, Intel have announced the emergency "Knights Mill" processor jammed
into the roundmap between KNL and Knights Hill. It's specifically targeted at
deep learning workloads and one might expect FP16 support. They had a bullet
point suggesting 'variable' precision too. I would guess that means Williamson
style variable fixed point. (I also guess that the Nervena "flexpoint" is a
trademarked variant of it).

I assume the FPGA inference card supports fp16. And Lake Crest (the first
Nervena chip sampling next year) will support flex point of course. I would
expect subsequent Xeon / Lake Crest successor integrations to do the same.

Fun times..

Aside on the compiler work -- I think it's not that hard to emit this
instruction at least for GEMM style kernels where it's relatively obvious.

~~~
paulsutter
Yes a compiler can generate the instruction. But if it's alone in a for loop
surrounded by random STL classes - which even if inlined - are bodging up the
pipeline or (gasp) causing spurious random dram accesses, there's little
performance gain. And that's what usually happens in c++ code that wasn't
already designed for AVX ("it's using AVX, but it's not running any faster. i
guess AVX doesn't make much difference").

Net-net, data and code need to be structured for AVX to achieve the potential
performance gains, and that's 80% of the work.

Once you structure the data and code for AVX, yes you can use regular C
statements, then experiment with optimization flags until the compiler
generates the intended instructions (and hasn't introduced excessive register
spills). But its hard to see how that's any easier than using the intrinsics.

~~~
stuntprogrammer
The problem is less the spurious DRAM accesses etc, as awful as they would be.
The compiler problem is really a mix of 1) understanding enough about fixed-
bound unit-stride loops to nonoverlapping memory (or transforming access to
such) and 2) data layouts that prevent that. E.g. while there are well
understood data layouts at each point of the compilation pipeline, it's hard
in general for compilers to profitably shift from array of structs to struct
of array layouts.

You are correct that, generally speaking, most STL heavy code would be hard to
vectorize and unlikely to gain much advantage. (Plus there are the valarray
misadventures). You will sometimes see clang and gcc vectorize std::vector if
the code is simple enough, and they can assume strict aliasing. Intel's
compiler has historically been less aggressive about assuming strict aliasing.

Various proposals are working through the standard committee to add explicit
support for SIMD programming. E.g. if something like [http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n418...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n4184.pdf) were to be standardized we
could write matrix multiply explicitly as:

    
    
      using SomeVec = Vector<T>
      for (size_t i=0; i<n; ++i) {
        for (size_t j=0; k<n; j+=SomeVec::size()) {
          SomeVec c_ij = A[i][0] * SomeVec(&B[0],j, Aligned);
          for (size_t k = 1; k < n; ++k) {
            c_ij += A[i][k] * SomeVec(&N[k][j], Aligned);
          }
          c_ij.store(&C[i][j], Aligned);
        }
      }
    

For my own work on vector languages and compilers I've had an easier time of
it since they have been designed to enable simpler SIMD code generation.

------
hackcrafter
Man, when I see this stuff I sure hope there is maturation of auto-
vectorization at the compiler level in clang etc.

Even more useful would be compiler-level feedback of how to stay within the
constraints needed to auto-vectorize your C/C++ for loop. (I need to make this
data access const etc)

As far as I know, the Intel compiler is ahead of MSVC/clang on this front
without reverting to OpenMP or other annotations on your code.

~~~
jcranmer
Intel is abandoning their compiler infrastructure and moving everything to
Clang/LLVM. This does mean pushing their autovectorization work into LLVM,
although judging from the quality of conversation in the vectorization BoF at
the latest developer's meeting, it's not clear how much work they wish to put
in to actually making acceptable upstreamable patches.

~~~
robinhoodexe
It sure would be nice if MKL could be integrated in clang. It's still the
fastest LAPACK implementation. I use it on a relatively powerfull cluster
(~11k cores) at university for doing quantum chemical calculations in Dalton,
and having more research software as open source would benefit everyone in the
end I believe.

~~~
hackcrafter
Here here!

I run into this often, it is amazing the speedup MKL provides for linear
algebra heavy scientific computations and it is under this weirdly commercial,
but permissively redistrubutable framework which means a lot of folks are
using it unkowningly in a gray licensing area.

See the pre-build python Numpy/Scipy packages that use it and are often use by
data science types:

[http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy](http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy)

~~~
infinite8s
Not sure about Chris's licensing of MKL, but Continuum has a redistributable
license of MKL packaged into their numpy builds for Anaconda -
[https://www.continuum.io/blog/developer-
blog/anaconda-25-rel...](https://www.continuum.io/blog/developer-
blog/anaconda-25-release-now-mkl-optimizations)

------
bedros
if you're interested in vector-SIMD algorithms check out this paper

Linear-time Matrix Transpose Algorithms Using Vector Register File With
Diagonal Registers

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.6964&rep=rep1&type=pdf)

disclaimer: I'm the author.

------
m_mueller
After 3 years of Xeon Phi I'm still waiting for OpenCL on Fortran with vector
support so we can finally have a sane and performant programming model for
these things. Instructions are neat, but the tooling support is just not there
for widespread use IMO. If Intel had taken a more long term strategy oriented
with OpenMP years ago, i.e. embracing accelerators of all kinds, instead of
trying to hold on tight to it for market protection, I think they'd be in a
better position now.

------
mtgx
Great. Will Intel still disable those features on lower-end chips, thus
ensuring that the market share for such features to make sense for developers
won't be reached anytime soon after release?

~~~
gpderetta
This is a KNL-only feature to workaround a microarchitectural limitation (max
two instruction issued per clock) to speed up a few specific
benchmarks^Wworkloads.

~~~
nkurz
1) Saying that it's for benchmarks only seems a little harsh, since once it's
in a few BLAS-like libraries it will probably be widely used.

2) I'm pretty sure this won't be in KNL. It's a proposed future extension, and
I don't think there has been indication of where or when it might land.

~~~
gpderetta
You are right on both counts.

From the article, it will supposedly end up on some furure knight variant (as
mainstream Xeons have lesa of a need for the "hack").

------
WhitneyLand
>“vector” instructions (multiple consecutive operations)

I thought "vector" in the context of CPU instructions just meant more than
one.

Is there a definition where vector implies consecutive?

~~~
gpderetta
Historically vector processors had vector registers but scalar ALU execution
units (although possibly more than one). Vector instructions were "just" a way
to make sure that the ALU was fed a new operation every cycle without
instruction fetch and loop overhead. It also made it easier to pipeline
reading from main memory (main memory latency wasn't so high a that time, so
the large vector operations made it possible to pipeline reads with processing
without stalling the CPU). None of those issues has been a bottleneck for a
while and the memory subsystem of a modern computer is significantly
different, so classic vector processors have fallen out of favour.

In contrast more modern SIMD machines normally have the vector execution units
as wide as the register themselves and the advantage, in addition to 1
N-vector ALU being more power and area efficient than N scalar ones, is that,
in an OoO machine, less in-flight instructions need to be tracked. It is also
easier to take advantage of wider memory/cache busses.

Because classic vectors machines processed elements one at a time, it was
possible to have efficient accumulating operations, which is significantly
harder on proper SIMD processors (so called horizontal operations).

------
faragon
In addition to vector SIMD, I would love Intel adding some VLIW stuff to the
x86 (mixed like SIMD is mixed with x86, i.e. not another "Itanium").

