

New instructions in Intel's Haswell microarchitecture - ssp
http://software.intel.com/en-us/forums/showthread.php?t=83399

======
ssp
Overview of the new instructions (cutted and pasted from the PDF):

AVX2 promotes the vast majority of 128-bit integer SIMD instruction sets to
operate with 256-bit wide YMM registers. AVX2 instructions are encoded using
the VEX prefix and require the same operating system support as AVX.
Generally, most of the promoted 256-bit vector integer instructions follow the
128-bit lane operation, similar to the promoted 256-bit floating-point SIMD
instructions in AVX. Newer functionalities in AVX2 generally fall into the
following categories:

• Fetching non-contiguous data elements from memory using vector-index memory
addressing. These “gather” instructions introduce a new memory-addressing
form, consisting of a base register and multiple indices specified by a vector
register (either XMM or YMM). Data elements sizes of 32 and 64-bits are
supported, and data types for floating-point and integer elements are also
supported.

• Cross-lane functionalities are provided with several new instructions for
broadcast and permute operations. Some of the 256-bit vector integer instruc-
tions promoted from legacy SSE instruction sets also exhibit cross-lane
behavior, e.g. VPMOVZ/VPMOVS family.

• AVX2 complements the AVX instructions that are typed for floating-point
operation with a full compliment of equivalent set for operating with
32/64-bit integer data elements.

• Vector shift instructions with per-element shift count. Data elements sizes
of 32 and 64-bits are supported.

~~~
athom
I find the "gather" functionality especially intriguing.

APL and J have "transpose" instructions that let you rearrange the dimensions
of a multidimensional array. I figured out awhile back how that could be done
without having to move all the individual elements in memory; basically, keep
track of the 'step size' between elements on each axis, and you can shuffle
them around as much as you like. Of course, once you've done that, you're
jumping back and forth all over the array to pull in each element when/where
you want it.

Well, looky what those "gather" instructions do! Promising. _Very_
promising...

------
onan_barbarian
These are extremely interesting instructions. It's worth noting, however, that
things like gather/scatter live and die by the quality of their
implementation. I would hope that these are considered worth delivering high-
quality implementations, but let's just say that Intel has occasionally been
known to deliver complex instructions that are all but useless due to poor
latency and reciprocal throughput. SSE4.2, I'm looking at you.

I suspect that for these ones, it's going to be good - this looks too
fundamental to the sorts of changes first described in Larrabee programming
guides. The hope is that many of these functions (conditional loads,
scatter/gather, etc) will allow a much larger proportion of loops to be
parallelized than is currently the case. Obviously not everything is amenable
to parallelization due to serial dependencies, but having scatter/gather to do
multiple load/store stuff makes it a lot easier.

I think it is interesting the extent that this seems to be converging on some
of the stuff in Larrabee - is the game plan to converge type of cores with GPU
functionality so the difference between CPU/GPU cores is provisioning and
access to some GPU-specific canned functionality for traditional T&L
operations?

Overall it's like a wet dream for bit-bashers. PDEP? PEXT? Byte field extract?
Scatter? 256-bit integer SIMD? I was almost too excited to speak when I first
skimmed this (moderated somewhat by the 2013 release date and lingering
concerns that some of the better insns will have latency 8 and reciprocal
throughput 5 or something disappointing like that).

------
gorset
Wow, I can't wait to play around with PDEP and PEXT instructions.

~~~
seunosewa
What do they do that's so exciting?

~~~
gorset
I want to use them for speeding up some critical code for
compression/decompresison, the current code is not straightforward.

How to implement the instructions in software is covered in Hacker's delight,
and has some additions in the latest revisions
(<http://www.hackersdelight.org/revisions.pdf>).

There's also a chapter in Matters Computational
(<http://www.jjj.de/fxt/#fxtbook>) "CPU instructions often missed" that
mentions them:

> Primitives for permutations of bits, see section 1.29.2 on page 81. A bit-
> gather and a bit-scatter instruction for sub-words of all sizes a power of 2
> would allow for arbitrary permutations (see [FXT: bits/bitgather.h] and
> [FXT: bits/bitseparate.h] for versions working on complete words).

------
cageface
These days it's the ARM instruction set I care about. Strange how fast x86
dropped off my radar.

~~~
__rkaup__
What about servers?

~~~
cageface
Downmodders are really out of control here lately. Obviously I'm not saying
that _nobody_ cares about x86 anymore. It just very quickly went from the
years of being only game in town to a sideshow _for me_ when I started doing
mobile.

What server software I do write is in very HLL like Ruby but I actually find
myself looking at ARM disassembly sometimes now for apps.

~~~
cookiecaper
Maybe the feeling is that your personal statement of discontent with x86 is
not really relevant to the story that new instructions have been added to x86.
This is probably why you were downvoted. If you want to leave a comment with a
similar essence, perhaps you should try commenting on one or two new
instructions that are [interesting|upsetting|repetitive] to you ahead of your
statement that you don't care about ARM anymore.

Would HN be a good place if everyone who didn't care about something posted on
the story and said, "Ha, I don't care about this, ha ha"?

~~~
cageface
Since I started writing software professionally over ten years ago x86 was the
only instruction set that mattered outside of a few niche domains. Now it's
all but irrelevant in the fastest growing market. For those of us that have
lived in the shadow of Wintel our entire professional lives the last few years
have seen some dizzying changes.

So the big picture for me is that developments in x86 are now minor news.

~~~
jrockway
PowerPC used to be pretty popular. So was SPARC.

