These are extremely interesting instructions. It's worth noting, however, that t...

These are extremely interesting instructions. It's worth noting, however, that things like gather/scatter live and die by the quality of their implementation. I would hope that these are considered worth delivering high-quality implementations, but let's just say that Intel has occasionally been known to deliver complex instructions that are all but useless due to poor latency and reciprocal throughput. SSE4.2, I'm looking at you.

I suspect that for these ones, it's going to be good - this looks too fundamental to the sorts of changes first described in Larrabee programming guides. The hope is that many of these functions (conditional loads, scatter/gather, etc) will allow a much larger proportion of loops to be parallelized than is currently the case. Obviously not everything is amenable to parallelization due to serial dependencies, but having scatter/gather to do multiple load/store stuff makes it a lot easier.

I think it is interesting the extent that this seems to be converging on some of the stuff in Larrabee - is the game plan to converge type of cores with GPU functionality so the difference between CPU/GPU cores is provisioning and access to some GPU-specific canned functionality for traditional T&L operations?

Overall it's like a wet dream for bit-bashers. PDEP? PEXT? Byte field extract? Scatter? 256-bit integer SIMD? I was almost too excited to speak when I first skimmed this (moderated somewhat by the 2013 release date and lingering concerns that some of the better insns will have latency 8 and reciprocal throughput 5 or something disappointing like that).