
Different mmx sse and avx versions complementary or supersets of each other? - snoukkis
I&#x27;m thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can&#x27;t find a good overview on which of them are still relevant.<p>The x86 architecture has accumulated a lot of math&#x2F;multimedia extensions over decades:<p>MMX<p>3DNow!<p>SSE<p>SSE2<p>SSE3<p>SSSE3<p>SSE4<p>AVX<p>AVX2<p>AVX512<p>Did I forget something?<p>Are the newer ones supersets of the older ones and vice versa? Or are they complementary?<p>Are some of them deprecated? Which of these are still relevant? I&#x27;ve heard references to &quot;legacy SSE&quot;.<p>Are some of them mutually exclusive? I.e. do they share the same hardware parts?<p>Which should I use together to maximize hardware utilization on modern Intel &#x2F; AMD CPUs? For sake of argument, let&#x27;s assume I can find appropriate uses for the instructions... heating my house with the CPU if nothing else.
======
bit2mask
AVX rolls up all the previous SSE versions, and provides 3-operand versions of
those instructions. Also 256b versions of most FP (AVX) and int (AVX2) insns.

We don't really think of that making SSE obsolete. More like, think of AVX as
a new and better version of the same old SSE instructions. They're still in
the ref manual under their non-AVX names (PSHUFB, not VPSHUFB, for example.)
You can mix AVX and SSE code, as long as you use VZEROUPPER when needed to
avoid the performance problem from mixing VEX with non-VEX insns (on Intel).
So there is some annoyance to dealing with cases where you have to call into
libraries that might run non-VEX SSE instructions, or where your code uses SSE
FP math, but also has some AVX code to be run only if the CPU supports it.

If CPU-compatibility was a non-issue, the legacy-SSE versions of vector
instructions would be truly obsolete, like MMX is now. AVX/AVX2 is at least
slightly better in every way, if you count the VEX-encoded 128b version an
insn as AVX, not SSE. Sometimes you'd still use 128b registers because your
data only comes in chunks that big, but more often working with 256b registers
to do the same op on twice as much data at once.

SSE/AVX/x87-FP/integer instructions all use the same execution ports. You
can't get more done in parallel by mixing them. (except on Haswell, where one
of the 4 ALU ports can only handle non-vector insns, like GP reg ops and
branches).

------
firimari
Having spent some time optimizing signal processing applications for Sandy
Bridge and Haswell targets, the various flavors of SSE are largely additions
to the previous generations, and AVX is adding wider (more elements to the
vectors) versions of the instructions.

I haven't kept up totally with the AVX2 / 512 additions, but from what I've
read, the 512 version is only on certain enterprise versions of the newest
generation of Intel chips.

If you're looking for applications to use your new skills, GNURadio,
ImageMagick, and FFMpeg might have areas to dive into and experiment.

