
Introduction to GEN Assembly in OpenCL - ingve
https://software.intel.com/en-us/articles/introduction-to-gen-assembly
======
mattst88
Another Intel employee here (don't know the author). I work on Intel's Open
Source OpenGL driver, which is part of Mesa. Most of the work I do is on our
GLSL compiler and Gen hardware backend.

I've never used VTune (but it looks very cool) and some of the notation is
different from what I'm used to.

If you're using an Intel GPU made in the last 10 years on Linux, the
environment variable INTEL_DEBUG will allow you to see the disassembled
shaders for a given program. Try "INTEL_DEBUG=fs,vs glxgears" to see the
fragment and vertex shaders' assembly and the dumps of some intermediate
representations along the way.

The Gen (or i965, as we call it) instruction set is really powerful, but does
take a while to understand fully.

I've actually been trying to finish up an article about some tricks we do in
the i965_dri.so driver (think bit twiddling hacks, but each using some
interesting features of the instruction set). I'm curious if there's interest
in such a thing. I'd probably be more motivated to finish it. :)

~~~
Narishma
What's the reason for generating 2 versions of each fragment shader (SIMD8 and
SIMD16)?

~~~
mattst88
That's a good question.

SIMD8/SIMD16 refers to the number of fragments processed per thread
invocation. There is some overhead to spawning a thread, and so processing 16
fragments at a time is typically faster even though the shader itself is doing
more work.

The driver provides both versions because it's the GPU that decides which
version is use and where, even using both versions to shade the same
primitive. For instance, on a triangle boundary maybe only 4 or 8 fragments
are "lit", so the hardware spawns a SIMD8 thread and saves itself a little
work.

SIMD16 shaders typically use twice the number of registers as SIMD8, and if
they require registers to be spilled to memory it's likely the SIMD8 shader is
faster even with the additional thread-spawning overhead. Lots of compiler
optimization revolves around trying to squeak in under the register limit to
get the program compiled as SIMD16 without spilling. :)

------
iheartmemcache
Just a side-note (I'm not an Intel employee, but I'ven't seen much Intel
awareness re: their tooling here [they're fairly bad at reaching out to this
community, but they have tons of free tools available that'll save your team
money if you're doing anything that's not IO bound]):

* [http://ispc.github.io/](http://ispc.github.io/) -> An Intel front-end compiler for "SPMD" that targets LLVM, which has proven to be useful.

* PIN -> Dynamic analysis, free. Think DTrace and Valgrind on steroids. (Not open source, I'll take what I can get)

* ICC -> Their compiler suite gets little love on HN (though there aren't many engineers who really write computationally intensive stuff here, or if they do I suppose they just throw Amazon instances at it), but it's so cheap for what you get. Such an extensive tooling set out of the box at around the pricing of VS (both of which are well under 5k for 1 seat re: their highest version).

My team has done a few jobs where we've jumped in using Intel tooling and both
shifted and tightened the 1st-3rd quadrant computational time by literally
3-5x and 10-20x respectively by basically just using the Intel tooling and
being fairly familiar with it. Shameless plug & all ;)

