Hacker News new | comments | show | ask | jobs | submit login
Intel Compiler Intrinsics Guide (intel.com)
97 points by lkurusa 8 months ago | hide | past | web | favorite | 38 comments

Hey folks, I'm the owner of the Intrinsics Guide at Intel. Let me know if you have any specific feedback or questions you have.

Before version 3.3.16 of that documentation, most instructions contained a small table with architecture, latency and throughput columns.

Since version 3.3.16, including the current 3.4, the guide no longer contains this info.

Why have you removed that?

That was an oversight. I'll resolve that, and then you should see latency/throughput data through Skylake.

Awesome, thanks!

I use this all the time, it's a wonderful tool! One feature I would really like is the ability to have multiple text filters, either as part of the query or explicitly. I often find myself searching for a certain intrinsic (or family of intrinsics) that operate on a certain bit width.

Otherwise, this is one of the best cpu documentation tools I've used.

As a side request, it would be amazing if the intel documentation on non-temporal memory operations and sfence+lfence had more specifics about how they interact with the rest of the cache hierarchy and the load/store subsystems.

I use this almost daily, it's great! The only thing I can even think of adding that might help me is the ability to only show intrisics based on data type, e.g. only show functions that use __m128, etc.

Some offtopic notes about this webpage.

Nice visual design.

Searching for "dot product" does not work, but searching for "dot-product" does work.

Also, every character I type in the search box ends up as a new entry in my browser history ...

This is one of the reasons I posted this. The visual design is very eye-catching and the level of detail on the intrinsics is very nice. It's also great to see which ones are for SSE{,2,3,S3,4.1} and AVX{,2,512}.

Dash (and thus also Zeal) has a user-contributed docset of this guide. So, if you want to search this quickly locally, you can do so with Dash.

This is arguably tangential, but also arguably relevant:

How does hardware video decoding and encoding work?

Does the fact that modern processors have on-chip GPUs with acceleration mean the instructions are any different?

(Graphics has long been a "???" of mine, but I don't want to get into generalizations in this particular thread. Hardware video {de,en}coding seems mildly relevant though.)

Integrated graphics aren't part of any of the actual processors, but rather, the integrated GPU sits alongside the CPUs as one core sits alongside another, and sharing the chip's L3 cache. CPUs communicate with the GPU via a memory-mapped interface, the same way they talk to most devices. The GPU itself does have its own architecture and instruction set, a lot of which are going to deal with specific coding schemes like H.265 rather than the generic sort of instructions you'd find on a CPU.

Hardware decoding has its own module, commonly called "VDEC" in most architectures. It is mostly functionally divorced from the rest of the GPU.

GPUs themselves are mostly SIMD vector processors for their "shader cores" with a bunch of custom fixed-function hardware for the more special blocks in the pipeline. This is a completely separate unit from the CPU. There was an attempt from Intel known as "Larabee" to try and build a GPU style pipeline on top of an expanded Intel CPU. The consumer product was canned but the expanded CPU went on to be known as Intel's Xeon Phi line.

There are ASICs on board for that.

Why are there different names for the same intrinsic?

For example, _mm_add_pi8() seems to be identical to _m_paddb(). Same for _m_empty() and _mm_empty().

Are there any subtleties I'm missing?

_m seems to be the prefix for MMX, whereas _mm is the prefix for SSE.

I think the instructions which didn’t need to change just got a new intrinsic alias so that each instruction set is self-contained, i.e. when working with SSE, you should only need to look at the SSE docs, not also know the MMX docs already.

Also that those don't use the same set of registers, you're forced to do so anyway

Wasn't MMX excluded from x86_64?

If you select a checkbox on the left, when you clear it the page scrolls back to the top. >_<

It would be nice if there was an online sandbox where you could easily play with the instructions.

There are several C++ sandboxes online. This one lets you set the compiler flags that you need for these SIMD ISAs, I think up through AVX2: http://www.compileonline.com/compile_cpp_online.php.

Yes that's certainly nice, but it would be even nicer if there was some example code for every CPU instruction that you could run this way (accessible using a single click).

I think this would be doable with QEMU, which AFAIK reasonably keeps up with CPU capabilities.

Hm, it could actually make for a fun hack.

Adds to todo list

I'm currently doing a series on the different ISA extensions to x86 over at dev.to/ and ultimately the idea behind the series was to build something like this by the end.

repl.it supports compiling and running code that uses many intrinsics.

Doesnt a cloud vm do this?

Technically yes, but that's really just the raw buttons and switches.

A version of the godbolt compiler explorer that included execution would be cool.

Would you be willing to pay for this and how much?

> without the need to write assembly code

You basically are writing assembly with these intrinsics.

Not true -- this is substantially more portable (the GNU compiler for instance will even give you fallback instructions). I number of years ago I wrote PPU intrinsics (so not Intel, but similar) wrapped via macros that worked on both the Xbox360 and PS3 PowerPC processor. On one instance I can remember (frustum culling in a complex scene graph) I saw a 3ms frame time delta. This was a few days of development time well spent.

I'm guessing these intrinics map 1:1 to some AMD equivalent. Oh, looky here: https://msdn.microsoft.com/en-us/library/hh977022.aspx

I get why you are saying that, but with intrinsics and C++ I can write template functions that compile to any data type(float/double) or any instruction set (SSE/AVX) that can be reused and inlined anywhere in my codebase, and I just let the compiler take care of register allocation, and most of the time the compiler (VC++ in this case) generates very optimal assembly. It's just so far away from assembly for this use case.

CISC in a nutshell

I would not call ARM a RISC. What is RISC in your opinion? Lack of memory addressing except load and store? Not true for ARMv8.1. Lack of microops? ARM1 (from 1985) has microops.

Reduced set of instructions, and simple encoding of them.

ARM instructions are always 32 bit; this includes NEON. There’re signs ARM developer indeed were trying to minimize their count (e.g. there’s no right shift neon instruction, instead left shift is used with negative shift value).

Thumb instructions can be 16 bit, but still this is way simpler than x86, where a single instruction can be between 8 and 120 bits.

"Simple encoding" doesn't mean its RISC. There are simple encoding instruction sets that are CISC. For example - PDP-10, it has fixed size, simple encoding, but is "classically" known as CISC.

CISC vs RISC for modern CPUs (last 20 or 30 years) doesn't mean anything. Any modern ARM or Intel is partially RISC and a CISC at the same time.

Think this way:

1) modern x86 micro-ops can be viewed as RISC. So x86 is RISC?

2) ARM1 (from 1985) had an micro-ops, that means user visible opcodes are not reduced enough. So ARM it CISC? http://www.righto.com/2016/02/reverse-engineering-arm1-proce... Not talking even about modern 64-bit arm cores.

I don’t think implementation details are relevant for CISC vs RISC. At least not anymore. I think it’s a characteristic of CPU’s instruction set. Not the internal undocumented instructions, but the instructions publically available to programmers.

Also, ARM says their CPU designs are based on RISC principles: https://developer.arm.com/products/architecture/cpu-architec...

I guess that's why all high performance CPUs of the last twenty years have something like this.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact