
Intel Compiler Intrinsics Guide - lkurusa
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
======
SoManyIntrinsic
Hey folks, I'm the owner of the Intrinsics Guide at Intel. Let me know if you
have any specific feedback or questions you have.

~~~
Const-me
Before version 3.3.16 of that documentation, most instructions contained a
small table with architecture, latency and throughput columns.

Since version 3.3.16, including the current 3.4, the guide no longer contains
this info.

Why have you removed that?

~~~
SoManyIntrinsic
That was an oversight. I'll resolve that, and then you should see
latency/throughput data through Skylake.

~~~
Const-me
Awesome, thanks!

------
amelius
Some offtopic notes about this webpage.

Nice visual design.

Searching for "dot product" does not work, but searching for "dot-product"
does work.

Also, every character I type in the search box ends up as a new entry in my
browser history ...

~~~
lkurusa
This is one of the reasons I posted this. The visual design is very eye-
catching and the level of detail on the intrinsics is very nice. It's also
great to see which ones are for SSE{,2,3,S3,4.1} and AVX{,2,512}.

------
Const-me
Offline version:

[https://github.com/Const-me/IntelIntrinsics](https://github.com/Const-
me/IntelIntrinsics)

------
danieldk
Dash (and thus also Zeal) has a user-contributed docset of this guide. So, if
you want to search this quickly locally, you can do so with Dash.

------
exikyut
This is arguably tangential, but also arguably relevant:

How does hardware video decoding and encoding work?

Does the fact that modern processors have on-chip GPUs with acceleration mean
the instructions are any different?

(Graphics has long been a "???" of mine, but I don't want to get into
generalizations in this particular thread. Hardware video {de,en}coding seems
mildly relevant though.)

~~~
dooglius
Integrated graphics aren't part of any of the actual processors, but rather,
the integrated GPU sits alongside the CPUs as one core sits alongside another,
and sharing the chip's L3 cache. CPUs communicate with the GPU via a memory-
mapped interface, the same way they talk to most devices. The GPU itself does
have its own architecture and instruction set, a lot of which are going to
deal with specific coding schemes like H.265 rather than the generic sort of
instructions you'd find on a CPU.

------
Narishma
Why are there different names for the same intrinsic?

For example, _mm_add_pi8() seems to be identical to _m_paddb(). Same for
_m_empty() and _mm_empty().

Are there any subtleties I'm missing?

~~~
secure
_m seems to be the prefix for MMX, whereas _mm is the prefix for SSE.

I think the instructions which didn’t need to change just got a new intrinsic
alias so that each instruction set is self-contained, i.e. when working with
SSE, you should only need to look at the SSE docs, not also know the MMX docs
already.

~~~
my123
Also that those don't use the same set of registers, you're forced to do so
anyway

------
exikyut
If you select a checkbox on the left, when you clear it the page scrolls back
to the top. >_<

------
amelius
It would be nice if there was an online sandbox where you could easily play
with the instructions.

~~~
zbjornson
There are several C++ sandboxes online. This one lets you set the compiler
flags that you need for these SIMD ISAs, I think up through AVX2:
[http://www.compileonline.com/compile_cpp_online.php](http://www.compileonline.com/compile_cpp_online.php).

~~~
amelius
Yes that's certainly nice, but it would be even nicer if there was some
example code for every CPU instruction that you could run this way (accessible
using a single click).

------
adraenwan
CISC in a nutshell

~~~
Const-me
Same for RISC:
[https://developer.arm.com/technologies/neon/intrinsics](https://developer.arm.com/technologies/neon/intrinsics)

~~~
mmozeiko
I would not call ARM a RISC. What is RISC in your opinion? Lack of memory
addressing except load and store? Not true for ARMv8.1. Lack of microops? ARM1
(from 1985) has microops.

~~~
Const-me
Reduced set of instructions, and simple encoding of them.

ARM instructions are always 32 bit; this includes NEON. There’re signs ARM
developer indeed were trying to minimize their count (e.g. there’s no right
shift neon instruction, instead left shift is used with negative shift value).

Thumb instructions can be 16 bit, but still this is way simpler than x86,
where a single instruction can be between 8 and 120 bits.

~~~
mmozeiko
"Simple encoding" doesn't mean its RISC. There are simple encoding instruction
sets that are CISC. For example - PDP-10, it has fixed size, simple encoding,
but is "classically" known as CISC.

CISC vs RISC for modern CPUs (last 20 or 30 years) doesn't mean anything. Any
modern ARM or Intel is partially RISC and a CISC at the same time.

Think this way:

1) modern x86 micro-ops can be viewed as RISC. So x86 is RISC?

2) ARM1 (from 1985) had an micro-ops, that means user visible opcodes are not
reduced enough. So ARM it CISC? [http://www.righto.com/2016/02/reverse-
engineering-arm1-proce...](http://www.righto.com/2016/02/reverse-engineering-
arm1-processors.html) Not talking even about modern 64-bit arm cores.

~~~
Const-me
I don’t think implementation details are relevant for CISC vs RISC. At least
not anymore. I think it’s a characteristic of CPU’s instruction set. Not the
internal undocumented instructions, but the instructions publically available
to programmers.

Also, ARM says their CPU designs are based on RISC principles:
[https://developer.arm.com/products/architecture/cpu-
architec...](https://developer.arm.com/products/architecture/cpu-architecture)

------
skookumchuck
> without the need to write assembly code

You basically are writing assembly with these intrinsics.

~~~
Negative1
Not true -- this is substantially more portable (the GNU compiler for instance
will even give you fallback instructions). I number of years ago I wrote PPU
intrinsics (so not Intel, but similar) wrapped via macros that worked on both
the Xbox360 and PS3 PowerPC processor. On one instance I can remember (frustum
culling in a complex scene graph) I saw a 3ms frame time delta. This was a few
days of development time well spent.

I'm guessing these intrinics map 1:1 to some AMD equivalent. Oh, looky here:
[https://msdn.microsoft.com/en-
us/library/hh977022.aspx](https://msdn.microsoft.com/en-
us/library/hh977022.aspx)

