AVX2 promotes the vast majority of 128-bit integer SIMD instruction sets to operate with 256-bit wide YMM registers. AVX2 instructions are encoded using the VEX prefix
and require the same operating system support as AVX. Generally, most of the
promoted 256-bit vector integer instructions follow the 128-bit lane operation,
similar to the promoted 256-bit floating-point SIMD instructions in AVX.
Newer functionalities in AVX2 generally fall into the following categories:
• Fetching non-contiguous data elements from memory using vector-index
memory addressing. These “gather” instructions introduce a new memory-addressing form, consisting of a base register and multiple indices specified by a
vector register (either XMM or YMM). Data elements sizes of 32 and 64-bits are
supported, and data types for floating-point and integer elements are also
• Cross-lane functionalities are provided with several new instructions for
broadcast and permute operations. Some of the 256-bit vector integer instruc-
tions promoted from legacy SSE instruction sets also exhibit cross-lane behavior,
e.g. VPMOVZ/VPMOVS family.
• AVX2 complements the AVX instructions that are typed for floating-point
operation with a full compliment of equivalent set for operating with 32/64-bit
integer data elements.
• Vector shift instructions with per-element shift count. Data elements sizes of 32
and 64-bits are supported.
APL and J have "transpose" instructions that let you rearrange the dimensions of a multidimensional array. I figured out awhile back how that could be done without having to move all the individual elements in memory; basically, keep track of the 'step size' between elements on each axis, and you can shuffle them around as much as you like. Of course, once you've done that, you're jumping back and forth all over the array to pull in each element when/where you want it.
Well, looky what those "gather" instructions do! Promising. Very promising...
I suspect that for these ones, it's going to be good - this looks too fundamental to the sorts of changes first described in Larrabee programming guides. The hope is that many of these functions (conditional loads, scatter/gather, etc) will allow a much larger proportion of loops to be parallelized than is currently the case. Obviously not everything is amenable to parallelization due to serial dependencies, but having scatter/gather to do multiple load/store stuff makes it a lot easier.
I think it is interesting the extent that this seems to be converging on some of the stuff in Larrabee - is the game plan to converge type of cores with GPU functionality so the difference between CPU/GPU cores is provisioning and access to some GPU-specific canned functionality for traditional T&L operations?
Overall it's like a wet dream for bit-bashers. PDEP? PEXT? Byte field extract? Scatter? 256-bit integer SIMD? I was almost too excited to speak when I first skimmed this (moderated somewhat by the 2013 release date and lingering concerns that some of the better insns will have latency 8 and reciprocal throughput 5 or something disappointing like that).
How to implement the instructions in software is covered in Hacker's delight, and has some additions in the latest revisions (http://www.hackersdelight.org/revisions.pdf).
There's also a chapter in Matters Computational (http://www.jjj.de/fxt/#fxtbook) "CPU instructions often missed" that mentions them:
> Primitives for permutations of bits, see section 1.29.2 on page 81. A bit-gather and a bit-scatter instruction for sub-words of all sizes a power of 2 would allow for arbitrary permutations (see [FXT: bits/bitgather.h] and [FXT: bits/bitseparate.h] for versions working on complete words).
On page 488, PDEP and PEXT are Parallel Bits Deposit and Parallel Bits Extract. They are essentially scatter/gather instructions for bits.
PDEP uses a mask in the second source operand (the third operand) to
transfer/scatter contiguous low order bits in the first source operand (the second
operand) into the destination (the first operand).
PEXT uses a mask in the second source operand (the third operand) to transfer either contiguous or non-contiguous bits in the first source operand (the second operand) to contiguous low order bit positions in the destination (the first operand).
What server software I do write is in very HLL like Ruby but I actually find myself looking at ARM disassembly sometimes now for apps.
Would HN be a good place if everyone who didn't care about something posted on the story and said, "Ha, I don't care about this, ha ha"?
So the big picture for me is that developments in x86 are now minor news.