
Intel will add deep-learning instructions to its processors - ingve
http://lemire.me/blog/2016/10/14/intel-will-add-deep-learning-instructions-to-its-processors/
======
antirez
That's a good news! About that, two days ago I modified the implementation of
neural networks inside Neural Redis in order to use AVX2. It was a pretty
interesting experience and in the end after modifying most of it, the
implementation is 2x faster compared to the vanilla C implementation (already
optimized to be cache obvious).

I never touched AVX or SSE in the past, so this was a great learning
experience. In 30 minutes you can get 90% of it, but I think that to really do
great stuff you need to also understand the relative cost of every AVX
operation. There is an incredible user at Stack Overflow that replies to most
AVX / SSE questions, if you check the AVX / SSE tag you'll find it easily.

However I noticed that when there were many load/store operations to do, there
was no particular gain. See for example this code:

    
    
        #ifdef USE_AVX
                __m256 es = _mm256_set1_ps(error_signal);
    
                int psteps = prevunits/8;
                for (int x = 0; x < psteps; x++) {
                    __m256 outputs = _mm256_loadu_ps(o);
                    __m256 gradients = _mm256_mul_ps(es,outputs);
                    _mm256_storeu_ps(g,gradients);
                    o += 8;
                    g += 8;
                }
                k += 8*psteps;
        #endif
    

What I do here is to calculate the gradient after I computed the error signal
(error * derivative of the activation function). The code is equivalent to:

    
    
        for (; k < prevunits; k++) *g++ = error_signal*(*o++);
    

Any hint about exploiting AVX at its max in this use case? Thanks. Ok probably
this was more a thing for Stack Overflow, but too late, hitting enter.

~~~
gok
Assuming this is your code: [https://github.com/antirez/neural-
redis/blob/master/nn.c](https://github.com/antirez/neural-
redis/blob/master/nn.c)

You're really just implementing a vector*matrix multiply. You probably want to
just use BLAS's sgemv routine. On macOS, link Accelerate and use
cblas_sgemv(); on Linux consider installing Intel MKL or OpenBLAS.

If you're just looking to learn what a reasonably state-of-the-art SGEMV
kernel looks like for a modern chip like Haswell, check out OpenBLAS's code:

[https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_6...](https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/sgemv_n_microk_haswell-4.c)

~~~
antirez
That sounds a great advice, I'll check this library. I've the feeling that the
fact I've non aligned addresses in the current weights scheme will be a
problem and padding will be required. That's why I used AVX "loadu" that deals
with non aligned addresses, but perhaps I'm paying a lot of performances
because of this. Thanks.

EDIT: apparently on modern CPUs that's not the case and magically LOADU can be
as fast as LOAD.

~~~
gens
Don't do unaligned memory access, whatever your cpu flags say.

Another thing you can do, if you don't plan on using the results immediately,
is to use non-temporal store (movntps for SSE2). If you do plan to use the
results right away, then just use them without storing in main memory.

And you can do the usual unrolling of loop.

~~~
nkurz
_Don 't do unaligned memory access, whatever your cpu flags say._

I don't think this is good advice. Do you have an example of poor performance
of unaligned memory access on a processor that supports AVX?

I think the better advice is:

1) Prefer aligned access if possible, favoring writes over reads.

2) A single unaligned access per cycle usually does not cause a slowdown.

3) Accesses that cross 64B cachelines essentially act as two accesses, thus
may add a cycle.

4) You can only sustain multiple accesses per cycle if you avoid crossing
cachelines.

5) Writes that cross 4KB pages have a 100+ cycle penalty on pre-Skylake Intel
(down to 5 cycles on Skylake).

~~~
gens
Does anybody have an example of good performance of unaligned memory access on
modern cpus ? And note that it doesn't matter if the cpu supports AVX, but if
it has a flag that says it can do fast unaligned memory access (i don't
remember, is it misalignsse ?).

Common sense says that unaligned access can't be faster then aligned. And if
you have data that fits into a ymm register, then you might as well use
aligned access (a neural network is usually an example of such).

I did test it a while ago. Problem is that i don't remember if it was on this,
modern, cpu or the older one. I could test if i cared enough for other peoples
opinion, but alas i don't (only usage of unaligned AVX access i found to be
from newbies to SIMD). An example, that you request, would be to look at glibc
memcpy, that uses ssse3 [0] so that it could always get aligned access (ssse3
has per-byte operations).

In other words, how about that the people who claim that operations that do
extra work are as fast as the ones that don't prove it ? Instead of the burden
of proof falling on people that don't have such an opinion/experience ? Then i
will bow my head and say "You are right. Thank you for pointing that out". But
alas google-ing for 10min and have found no such benchmark anywhere. And
writing such a test isn't hard, not in the slightest.

[0][https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/...](https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/multiarch/memcpy-
ssse3.S)

~~~
nkurz
_In other words, how about that the people who claim that operations that do
extra work are as fast as the ones that don 't prove it? Instead of the burden
of proof falling on people that don't have such an opinion/experience ? Then i
will bow my head and say "You are right. Thank you for pointing that out". But
alas google-ing for 10min and have found no such benchmark anywhere. And
writing such a test isn't hard, not in the slightest._

I tend to the opposite view: those saying "do not do X" are in fact obligated
to explain why X should be avoided. But perhaps this is just a difference in
worldview.

One good reference for the "alignment doesn't matter" view is an earlier post
on the same blog that is being discussed here:
[http://lemire.me/blog/2012/05/31/data-alignment-for-speed-
my...](http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-
reality/)

I linked elsewhere in the thread to my more detailed experiments regarding
unaligned vector access on Haswell and Skylake:
[http://www.agner.org/optimize/blog/read.php?i=415#423](http://www.agner.org/optimize/blog/read.php?i=415#423).
This is the source of my conclusion that alignment is not a significant factor
when reading from L3 or memory, but does matter when attempting multiple reads
per cycle from L1.

Both of these link to code that can be run for further tests. If you find an
example of an unaligned access that is significantly slower than an aligned on
a recent processor (and they certainly may exist) I'll nudge Daniel into
writing an update to his blog post.

~~~
gens
>I tend to the opposite view: those saying "do not do X" are in fact obligated
to explain why X should be avoided. But perhaps this is just a difference in
worldview.

For me it depends on the context. Here aligned access makes more sense so
unaligned should be defended.

I hacked together a test, feel free to point out mistakes.

c part: [http://pastebin.com/zMha8Fre](http://pastebin.com/zMha8Fre)

asm part(SSE and SSE2 for nt):
[http://pastebin.com/mxEFC8Cw](http://pastebin.com/mxEFC8Cw)

results:

aligned: 0 sec, 69049070 nsec

unaligned on aligned data: 0 sec, 69210069 nsec

unaligned on one byte unaligned data: 0 sec, 70278354 nsec

unaligned on three bytes unaligned data: 0 sec, 70315162 nsec

aligned nontemporal: 0 sec, 42549571 nsec

naive: 0 sec, 67741031 nsec

Repeating the test only shows non-temporal to be of benefit. The difference
of, on average, 1-2% is not much, that i yield. But it is measurable.

But that is not all! Changing the copy size to something that fits in the
cache (1MB) showed completely different results.

aligned: 0 sec, 160536 nsec

unaligned on aligned data: 0 sec, 179999 nsec

unaligned on one byte unaligned data: 0 sec, 375108 nsec

aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte
unaligned

And, out of interest, i made all the copy-s skip every second 16 bytes,
(relative) results are the same as the original test except non-temporal being
over 3x slower then anything else.

And this is on a amd fx8320 that has the misalignsse flag. On my former cpu
(can't remember if it was the celeron or the amd 3800+) the results were
_very_ much in favor of aligned access.

So yea, align things. It's not hard to just add " __attribute__ ((aligned
(16))) " (for gcc, idk anything else).

PS It may seem like the naive way is good, but memcpy is a bit more
complicated then that.

~~~
qb45
See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128
or 1024. I think what you observed is the result of loads and stores hitting
the same cache set at the same time, all while misalignment additionally
increases the number of cache banks involved in any given operation. But
that's just hand-waving, I don't know the internals enough to say with
confidence what's going on exactly.

BTW, changing misalignment from 1 to 8 reduces this effect by half on my
Thuban. Which is important, because nobody sane would misalign an array of
doubles by 1 byte, while processing part of an array starting somewhere in the
middle is a real thing.

Also, your assembly isn't really that great. In particular, LOOP is microcoded
and sucks on AMD. I got better results with this:

    
    
      typedef float sse_a __attribute__ ((vector_size(16), aligned(16)));
      typedef float sse_u __attribute__ ((vector_size(16), aligned(1)));
      
      void c_is_faster_than_asm_a(sse_a *dst, sse_a *src, int count) {
              for (int i = 0; i < count/sizeof(sse_a); i += 8) {
                      dst[i] = src[i+0];
                      dst[i] = src[i+1];
                      dst[i] = src[i+2];
                      dst[i] = src[i+3];
                      dst[i] = src[i+4];
                      dst[i] = src[i+5];
                      dst[i] = src[i+6];
                      dst[i] = src[i+7];
              }
      }
      void c_is_faster_than_asm_u(sse_u *dst, sse_u *src, int count) {
              // ditto

~~~
gens
>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128
or 1024.

Tested. There's a greater difference between aligned and aligned_unaligned.
But that made the test go over my cache size (2MB per core), so i tested with
512kB with and without your +128. Results were (relatively) similar to the
original 1MB test.

>Which is important, because nobody sane would misalign an array of doubles by
1 byte [...]

Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy
all the time). The code from the person above also does because compilers
sometimes do (aligned mov sometimes segfaults if you don't tell the compiler
to aligned an array, especially if it's in a struct).

>Also, your assembly isn't really that great. In particular, LOOP is
microcoded and sucks on AMD. I got better results with this:

Of course you did, you unrolled the loop. The whole point was to test memory
access, not to write a fast copy function.

>c_is_faster_than_asm_a()

First of all, that is not in the C specification. It is a
gcc/clang/idk_if_others extension to C. It compiles to similar what I would
write if i had unrolled the loop. Actually worse, here's what it compiled to
[http://pastebin.com/yL31spR2](http://pastebin.com/yL31spR2) . Note that this
is still _a lot_ slower then movnpts when going over cache size.

edit: I didn't notice at first. Your code copies 8 16byte... chunks to the
first. You forgot to add +n to dst.

~~~
qb45
Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is
sufficient.

You are right, 128 is not enough on Piledriver. Still,

    
    
      ./test $(( 512*1024+1024*0 ))
      aligned: 0 sec, 134539 nsec
      unaligned on aligned data: 0 sec, 101471 nsec
      unaligned on one byte unaligned data: 0 sec, 190368 nsec
      unaligned on three bytes unaligned data: 0 sec, 181823 nsec
      aligned nontemporal: 0 sec, 359920 nsec
      naive: 0 sec, 214007 nsec
      c_is_faster_than_asm_a:   0 sec, 92437 nsec
      c_is_faster_than_asm_u:   0 sec, 92643 nsec
      c_is_faster_than_asm_u+1: 0 sec, 156574 nsec
      c_is_faster_than_asm_u+3: 0 sec, 156359 nsec
      c_is_faster_than_asm_u+4: 0 sec, 154932 nsec
      c_is_faster_than_asm_u+8: 0 sec, 155784 nsec
    
      ./test $(( 512*1024+1024*1 ))
      aligned: 0 sec, 107036 nsec
      unaligned on aligned data: 0 sec, 94861 nsec
      unaligned on one byte unaligned data: 0 sec, 114444 nsec
      unaligned on three bytes unaligned data: 0 sec, 115915 nsec
      aligned nontemporal: 0 sec, 407951 nsec
      naive: 0 sec, 219215 nsec
      c_is_faster_than_asm_a:   0 sec, 82474 nsec
      c_is_faster_than_asm_u:   0 sec, 82554 nsec
      c_is_faster_than_asm_u+1: 0 sec, 112544 nsec
      c_is_faster_than_asm_u+3: 0 sec, 115159 nsec
      c_is_faster_than_asm_u+4: 0 sec, 198434 nsec
      c_is_faster_than_asm_u+8: 0 sec, 118952 nsec
    

4k is the stride of L1, your code slows down 1.5x:

    
    
      ./test $(( 512*1024+1024*4 ))
      aligned: 0 sec, 107576 nsec
      unaligned on aligned data: 0 sec, 94010 nsec
      unaligned on one byte unaligned data: 0 sec, 140534 nsec
      unaligned on three bytes unaligned data: 0 sec, 140517 nsec
      aligned nontemporal: 0 sec, 467981 nsec
      naive: 0 sec, 206891 nsec
      c_is_faster_than_asm_a:   0 sec, 85294 nsec
      c_is_faster_than_asm_u:   0 sec, 85174 nsec
      c_is_faster_than_asm_u+1: 0 sec, 118674 nsec
      c_is_faster_than_asm_u+3: 0 sec, 118902 nsec
      c_is_faster_than_asm_u+4: 0 sec, 118370 nsec
      c_is_faster_than_asm_u+8: 0 sec, 118638 nsec
      

128k is the stride of L2, both codes slow down further:

    
    
      ./test $(( 512*1024+1024*128 ))
      aligned: 0 sec, 167906 nsec
      unaligned on aligned data: 0 sec, 140650 nsec
      unaligned on one byte unaligned data: 0 sec, 239271 nsec
      unaligned on three bytes unaligned data: 0 sec, 251342 nsec
      aligned nontemporal: 0 sec, 458850 nsec
      naive: 0 sec, 364731 nsec
      c_is_faster_than_asm_a:   0 sec, 125240 nsec
      c_is_faster_than_asm_u:   0 sec, 118917 nsec
      c_is_faster_than_asm_u+1: 0 sec, 197348 nsec
      c_is_faster_than_asm_u+3: 0 sec, 196755 nsec
      c_is_faster_than_asm_u+4: 0 sec, 199757 nsec
      c_is_faster_than_asm_u+8: 0 sec, 197842 nsec

------
sbierwagen
Blogspam, kinda. Post just links to
[https://software.intel.com/sites/default/files/managed/69/78...](https://software.intel.com/sites/default/files/managed/69/78/319433-025.pdf)
and says it mentions two new instructions: AVX512_4VNNIW (Vector instructions
for deep learning enhanced word variable precision) and AVX512_4FMAPS (Vector
instructions for deep learning floating-point single precision)

On Intel's part, it seems kinda... late? It's like adding Bitcoin instructions
when you already know everyone's racing to make Bitcoin ASICs. How could it
beat dedicated hardware, or even GPUs, on ops/watt? Maybe it's intended for
inference, not training, but that doesn't sound compelling either.

~~~
dsabanin
Intel is also adding FPGAs[1] right next to their CPUs, which can possibly
become a next big thing in HPC.

[1] [http://www.extremetech.com/extreme/184828-intel-unveils-
new-...](http://www.extremetech.com/extreme/184828-intel-unveils-new-xeon-
chip-with-integrated-fpga-touts-20x-performance-boost)

~~~
yid
This aspect of their new chips is massively underrated. An FPGA is the future-
proof solution here, not chip-level instructions for the soup-du-jour in
machine learning.

Edit: which is not to say that I'm not welcoming the new instructions with
open arms...

~~~
astrodust
I'm not as hyped about FPGA-in-CPU so much as I am of having Intel release a
specification for their FPGAs that will allow development of third-party tools
to program them.

Right now the various vendors seem to insist on their own proprietary
_everything_ which makes it hard to streamline your development toolchain.
Many of the tools I've used are inseparably linked to a Windows-only GUI
application.

~~~
ronald_raygun
I'm not too familiar with FPGAs, but isn't the tradeoff that since they are
flexible they are usually much slower than CPUs/GPUs and it is usually used to
prototype an ASIC? How is FPGA-in-CPU going to be a good thing?

~~~
astrodust
They're slower in terms of clock speed, but they're not slower in terms of
results.

You can do things in an FPGA that a CPU can't even touch, it can be configured
to do massively parallel computations for example.

If Bitcoin is any example, GPU is faster than CPU, FPGA is faster than GPU,
and ASIC is faster than FPGA. Each one is at least an order of magnitude
faster than the other.

A GPU can do thousands of calculations in parallel, but an FPGA can do even
more if you have enough gates to support it.

I haven't looked too closely at the SHA256 implementations for Bitcoin, but it
is possible to not only do a lot of calculations in parallel, but also have
them pipelined so each round of your logic is actually executing
simultaneously on different data rather than sequentially.

------
yongjik
Apologies if I'm dumb, but did Intel actually tell exactly what operations
these "deep learning" instructions do?

I skimmed through the linked Intel manual, but it seems that it just defines
two instruction family names (AVX512_4VNNIW and AVX512_4FMAPS) without
actually saying what they do.

* It almost feels like a marketing term, in the same way everything was "multimedia instructions" back in the 90s.

~~~
iaw
It's been a long time since I look at machine language but, if I understand
correctly, the difference is the capacity of the operation. Instead of
performing an addition operation on 1 or 2 bytes at a time the operation can
be performed on 64 bytes at a time. Given that 'deep learning' requires a
number of parallel simple operations expanding the volume of parallel
operations in a single CPU instruction boost capacity.

 __Seriously, just supposition until someone with better understanding chimes
in.

~~~
bradhe
heh, if this is the case then we're going to benefit in a lot of areas from
this.

------
zbjornson
Really curious how Intel (others?) determined that these new instructions are
for deep learning. The instructions are for permutation, packed
multiply+low/high accumulate, and a sort of masked byte move. Are these super
common in deep learning?

~~~
p1esk
Deep learning is all about GEMM operations: matrix-matrix, or matrix-vector
multiplications. And the values in these matrices are typically low precision
(16 or even 8 bits is enough). So if you can pack many of such low precision
values into vectors, and perform a dot product in parallel, you will get a
pretty much linear speedup.

------
unsignedqword
Cool stuff, but unfortunately Intel has really delayed AVX512 instructions for
their main consumer processors (ffs, it was supposed to hit on Skylake). It
looks like we have another die shrink to go after Kaby Lake before we get that
sweet ultra-wide SIMD:

[https://en.wikipedia.org/wiki/Cannonlake](https://en.wikipedia.org/wiki/Cannonlake)

~~~
p1esk
Skylake Xeons are still on track to have AVX512, last I checked.

------
pritambaral
Hugged. Google cache link:
[http://webcache.googleusercontent.com/search?q=cache:lemire....](http://webcache.googleusercontent.com/search?q=cache:lemire.me/blog/2016/10/14/intel-
will-add-deep-learning-instructions-to-its-
processors/&ie=utf-8&oe=utf-8&client=firefox-b-
ab&gws_rd=cr&ei=yg0BWIaKIIzVvgSfr4uIDQ)

------
eveningcoffee
I did tests with Torch7 few days ago to compare it with OpenBLAS library on
multiple cores against CUDNN.

Single core difference was around 100x.

Using multiple cores improved the situation but there still was about 15x
difference that did not improve by adding more cores (I tested with up to Xeon
64 cores on a single machine).

I did not test with Intel MKL library as I did not have time for this.

I wonder how much these instructions would improve the situation and does
anybody here have experience with Intel MKL library.

------
happycube
Not holding my breath, since they haven't gotten AVX512 into desktops yet!
(Kaby Lake might be worth something if they did... but nope, just more lie7's
mostly...)

------
schappim
There is actually an undocumented neural net engine within the Intel chip on
the Arduino 101.

~~~
schappim
I guess it's documented now: [http://www.general-
vision.com/products/curieneurons/](http://www.general-
vision.com/products/curieneurons/)

------
rosstex
Could someone explain how these are deep-learning optimized? All I see is that
they're part of the 512-bit vector family that has previously existed.

------
gragas
Isn't this basically just wider SIMD? Is this in anyway actually tailored
specifically to deep learning?

------
jostmey
How much will this help? Neural networks work so well on distributed computing
systems because the operations run in parallel. With a neural network, you can
leverage tens of thousands of separate computing cores. Top intel processors
have around a dozen cores or so.

What am I missing here?

~~~
steego
> What am I missing here?

There may be a market sweet spot between dedicated high-performance machine
learning that's ideal for experts who are willing to buy/lease dedicated
hardware and regular developers who want to use simpler machine learning
toolkits to solve easier problems. Toolsets like Julia, NumPy, R, etc would
take advantage of the new instruction and basically give your average a speed
boost for simply having an Intel PC.

~~~
eb0la
This reminds me when you had to buy crypto accelerator cards for your
webservers (does anyone remember Soekris?)...

The market stalled after Intel started adding support for crypto operations
on-die.

This looks like Intel is protecting his server chip business shooting at
nVidia before they start selling Titan-X with other server chip (what chip - I
don't know; but I bet there's a business plan on a spreadsheet somewhere).

I guess training neural nets will a GPUs / custom chips task for some time;
but as soon as you have developed the classifier / predictor you have to run
it on whatever hardware you have and that means you might need less parallel
computing power and it doesn't make (financial) sense buying GPUs for that.

------
partycoder
You can also try a binarized neural network.

[https://arxiv.org/abs/1602.02830](https://arxiv.org/abs/1602.02830)

------
visarga
> Intel is hoping that the Xeon+FPGA package is enticing enough to convince
> enterprises to stick with x86, rather than moving over to a competing
> architecture (such as Nvidia’s Tesla GPGPUs ).

Then let it prove its performance on a few deep learning benchmarks. If is so
fast and accessible, we'll be impressed.

------
betolink
Is this the moment when Sarah Connor shows up to ruin the party?

------
thasaleni
Excuse my ignorance, but isn't "deep learning instructions" an oxymoron

------
joelthelion
So Intel CPUs will be able to perform 16 operations at a time, when Nvidia can
do thousands?

~~~
sitkack
Latency

~~~
j1vms
> Latency

This. Also, trying to slow the loss of customers purchasing what are
essentially Nvidia GPU-heavy systems with Nvidia CPUs tacked on to act as the
controllers.

------
ktamiola
Sounds good to me. Let's hope this is not just a marketing move.

------
AvenueIngres
What are the implications of this? Faster processing?

~~~
gragas
TL;DR: You can basically do 8 operations at once (but of course, it's not that
simple).

Think about a normal 64-bit register: you can add and subtract 64-bit numbers,
OR and AND them, etc. Now think about a 64- _byte_ register. What could you do
with that? Well, suppose you're looping through contiguous memory and changing
each word AND the change that your making to each word is independent of the
change you make to any other word. With a normal 64-bit register you'd have to
do one operation on each word. But with a 64-byte "register", you can load 8
words into it and do just one operation for the same effect as applying that
operation 8 times (once to each word). Thus, some code--- _vectorizable_ code
---can be sped up nearly 8x.

------
zenobit256
Great, another extension to the ever-growing instruction x86 instruction
set...

I'd question why, but at this point, we're just going to keep bolting things
on to an existing architecture.

"Microcode it. Why not?"

------
tomrod
Well. I guess I'm going to be using Intel!

------
denim_chicken
I hope they will be more useful than MPX.

------
sunstone
"get me a beer' would be an appreciated instruction to add.

