Discussed here a couple days ago?
Having x86 or ARM CPU cores sharing the same die and memory space as streaming processing cores is a straight path towards a "perception engine" that will power up machine learning, text analysis, machine vision and simlar tasks 100 times.
I'm hoping for high end server parts that will help me build generative and discriminative models, but the current fusion "low-end" parts for smart phones, tablets and cheap laptops use 1/5 the power consumption of conventional architecture while running tasks suitable for GPU -- for instance, video encoding and decoding.
And if we get it all together, you'll be able to run the models I built on a big server to deliver intelligent system capabilities to mobile and desktop apps.
Anyway, Intel's in better shape than you think and in a significantly better position than AMD, especially after recent acquisitions. The big reason is that for anything above low-end, you don't want to share a die.
Hypothetical architecture: put a future Xeon Phi chip (descendant of Larrabee, formerly Knights Corner, etc) on QPI. Have its on-die memory controller be based on GDDR5 (or a GDDR5 successor--low capacity, high latency, high bandwidth). Put another standard Xeon next to it, with SDDR4 or the SDDR4 successor on its memory controller (high capacity, low latency, relatively low bandwidth). Now maybe put an InfiniBand chip from the QLogic acquisition on there, maybe some fast path to an Intel SSD as well, and voila, you've got an all-Intel HPC node with shared CPU/vector processor address space, and you don't even need PCIe.
The idea of a combined CPU/GPU for servers with high performance on either the CPU or the GPU side is a pipe dream for a few reasons:
1. The big driver of GPU performance isn't FLOPs, it's bandwidth. Most of the applications out there on GPUs today are bandwidth limited, not FLOP limited. In other words, the max performance gain you're looking at from a GPU port is on the order of the bandwidth boost, which has been ~2.5x per socket since Nehalem or so.
2. The reason GPUs can get so much bandwidth is because they throw everything else under the bus in the quest for bandwidth--GPU memory latency is an order of magnitude or two higher than CPU memory latency, capacity is painfully limited, everything's soldered down, etc. (The reason why they do this is that sufficiently data-parallel applications can get away with high latency and that GPUs can therefore be big latency-hiding machines.)
3. If you try to use memory with the wrong characteristics for a given processor, you're basically going to cripple that processor. A GPU with 128GB of memory would be cool, but it would provide no benefit for most apps, even with a very fast interconnect between the CPU and GPU. A CPU with 12 or 24GB of GDDR5 would perform terribly due to the inability to hide memory latency and be a complete joke on the marketplace. Building both also doesn't really work due to fab constraints.
So really, for anything where you're intending to use the GPU/data parallel side as an integrated accelerator rather than an endpoint (that displays graphics to the screen), you want two dies. Intel's in very, very good shape there. (In mobile/low end, you can get away with slower memory for the GPU because you can rely on shared L2/L3 cache to make up for a lot of the perf loss. That is significantly less acceptable for big GPUs dealing with much bigger datasets.)
if I do a callableThing.map(iterableThing), I may not always care whether the callable does its magic sequentially on the elements of iterable. And, unless I explicitly use the resulting iterable, I don't care about the order the results are returned.
It would be fantastic for multiple reasons for this to be part of the JVM. Would make packaging code that uses the GPU simpler for cross system releases.
Could bring more desktop/game projects to Java.