Hacker News new | past | comments | ask | show | jobs | submit login
New instructions in Intel's Haswell microarchitecture (intel.com)
37 points by ssp on June 11, 2011 | hide | past | web | favorite | 14 comments

Overview of the new instructions (cutted and pasted from the PDF):

AVX2 promotes the vast majority of 128-bit integer SIMD instruction sets to operate with 256-bit wide YMM registers. AVX2 instructions are encoded using the VEX prefix and require the same operating system support as AVX. Generally, most of the promoted 256-bit vector integer instructions follow the 128-bit lane operation, similar to the promoted 256-bit floating-point SIMD instructions in AVX. Newer functionalities in AVX2 generally fall into the following categories:

• Fetching non-contiguous data elements from memory using vector-index memory addressing. These “gather” instructions introduce a new memory-addressing form, consisting of a base register and multiple indices specified by a vector register (either XMM or YMM). Data elements sizes of 32 and 64-bits are supported, and data types for floating-point and integer elements are also supported.

• Cross-lane functionalities are provided with several new instructions for broadcast and permute operations. Some of the 256-bit vector integer instruc- tions promoted from legacy SSE instruction sets also exhibit cross-lane behavior, e.g. VPMOVZ/VPMOVS family.

• AVX2 complements the AVX instructions that are typed for floating-point operation with a full compliment of equivalent set for operating with 32/64-bit integer data elements.

• Vector shift instructions with per-element shift count. Data elements sizes of 32 and 64-bits are supported.

I find the "gather" functionality especially intriguing.

APL and J have "transpose" instructions that let you rearrange the dimensions of a multidimensional array. I figured out awhile back how that could be done without having to move all the individual elements in memory; basically, keep track of the 'step size' between elements on each axis, and you can shuffle them around as much as you like. Of course, once you've done that, you're jumping back and forth all over the array to pull in each element when/where you want it.

Well, looky what those "gather" instructions do! Promising. Very promising...

These are extremely interesting instructions. It's worth noting, however, that things like gather/scatter live and die by the quality of their implementation. I would hope that these are considered worth delivering high-quality implementations, but let's just say that Intel has occasionally been known to deliver complex instructions that are all but useless due to poor latency and reciprocal throughput. SSE4.2, I'm looking at you.

I suspect that for these ones, it's going to be good - this looks too fundamental to the sorts of changes first described in Larrabee programming guides. The hope is that many of these functions (conditional loads, scatter/gather, etc) will allow a much larger proportion of loops to be parallelized than is currently the case. Obviously not everything is amenable to parallelization due to serial dependencies, but having scatter/gather to do multiple load/store stuff makes it a lot easier.

I think it is interesting the extent that this seems to be converging on some of the stuff in Larrabee - is the game plan to converge type of cores with GPU functionality so the difference between CPU/GPU cores is provisioning and access to some GPU-specific canned functionality for traditional T&L operations?

Overall it's like a wet dream for bit-bashers. PDEP? PEXT? Byte field extract? Scatter? 256-bit integer SIMD? I was almost too excited to speak when I first skimmed this (moderated somewhat by the 2013 release date and lingering concerns that some of the better insns will have latency 8 and reciprocal throughput 5 or something disappointing like that).

Wow, I can't wait to play around with PDEP and PEXT instructions.

What do they do that's so exciting?

I want to use them for speeding up some critical code for compression/decompresison, the current code is not straightforward.

How to implement the instructions in software is covered in Hacker's delight, and has some additions in the latest revisions (http://www.hackersdelight.org/revisions.pdf).

There's also a chapter in Matters Computational (http://www.jjj.de/fxt/#fxtbook) "CPU instructions often missed" that mentions them:

> Primitives for permutations of bits, see section 1.29.2 on page 81. A bit-gather and a bit-scatter instruction for sub-words of all sizes a power of 2 would allow for arbitrary permutations (see [FXT: bits/bitgather.h] and [FXT: bits/bitseparate.h] for versions working on complete words).

The document is pretty easy to navigate if you know which instructions you want to look up.

On page 488, PDEP and PEXT are Parallel Bits Deposit and Parallel Bits Extract. They are essentially scatter/gather instructions for bits.

PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in the first source operand (the second operand) into the destination (the first operand).

PEXT uses a mask in the second source operand (the third operand) to transfer either contiguous or non-contiguous bits in the first source operand (the second operand) to contiguous low order bit positions in the destination (the first operand).

You can finally implement INTERCAL's squiggle operator efficiently!

These days it's the ARM instruction set I care about. Strange how fast x86 dropped off my radar.

What about servers?

Downmodders are really out of control here lately. Obviously I'm not saying that nobody cares about x86 anymore. It just very quickly went from the years of being only game in town to a sideshow for me when I started doing mobile.

What server software I do write is in very HLL like Ruby but I actually find myself looking at ARM disassembly sometimes now for apps.

Maybe the feeling is that your personal statement of discontent with x86 is not really relevant to the story that new instructions have been added to x86. This is probably why you were downvoted. If you want to leave a comment with a similar essence, perhaps you should try commenting on one or two new instructions that are [interesting|upsetting|repetitive] to you ahead of your statement that you don't care about ARM anymore.

Would HN be a good place if everyone who didn't care about something posted on the story and said, "Ha, I don't care about this, ha ha"?

Since I started writing software professionally over ten years ago x86 was the only instruction set that mattered outside of a few niche domains. Now it's all but irrelevant in the fastest growing market. For those of us that have lived in the shadow of Wintel our entire professional lives the last few years have seen some dizzying changes.

So the big picture for me is that developments in x86 are now minor news.

PowerPC used to be pretty popular. So was SPARC.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact