Moore's law isn't about performance, it's the number of transistors on a single ...

varelse · on July 24, 2015

GTX 580: 585M transistors

GTX 680: 3.5B transistors

GTX Titan: 7.1B transistors

GTX 980: 5.2B transistors

GTX Titan X: 8B transistors

Core i7-5930k: 2.6B transistors

What the data above suggests to me is that relying solely on Moore's Law to predict performance is a fool's errand. Going forward, process transitions are obviously slowing down and IMO victory will go to those who make the best use of the available transistors. Just like programmers who make the best use of the caches and registers in these processors get dramatically better performance than those who can't be bothered to even think about such things.

Intel's business strategy of backwards-compatibility is a giant albatross for them here in that they spend a lot of transistors on this, but clearly otherwise profitable. In contrast, while GPUs are mostly backwards-compatible, they usually oops I meant nearly always oops I meant always need some refactoring to hit close to peak performance. But that usually leads to ~2x performance improvements per generation so far.

Whenever someone complains about having to do this I ask them if they prefer this over hand-coded assembler inner loops for maximally exploiting SSE/SSE2/SSE3/SSE4/AVX2/AVX512? Usually, I get some dismissive remark about leaving that to the compiler. Good luck with that plan IMO.

CHY872 · on July 25, 2015

Just to nitpick, backwards compatibility isn't really a huge issue for Intel. Most of the really old stuff that's a pain to maintain can be shoved in microcode; compilers won't emit those instructions.

There are obvious downsides to the architecture, but the need to be backwards compatibility shouldn't hurt it too much.

GPU workloads are very different in that generally you don't have to look particularly hard to find a bunch of parallelism that you can exploit (if you did, your code would run terribly); so you can generally gain a load of performance by just scaling up your design.

CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable.

arcticbull · on July 25, 2015

That's not really true; backwards compatibility on x86 architectures takes a tremendous amount of power and die space, and the 'throw it in microcode' solution only partially mitigates this issue.

A paper (http://www.ic.unicamp.br/~ra045840/cardoso2013wivosca.pdf) states that a mostly-microcode solution would still require 20% of the die area to be dedicated solely to microcode ROM.

I can't remember where I read it but something like 30+% of an Intel CPU die area/power consumption is due to the x86 ISA. Apparently the original Pentium CPU was 40% instruction decoding by die area. And the ISA has grown enormously since then.

varelse · on July 25, 2015

"CPUs are super restricted by the single threaded, branching nature of the code you run on them, and this is what makes CPU performance a little more nuanced, and not directly comparable."

Ironically, to really hit peak performance of a modern AVX2 or later CPU, you have to embrace many of the design principles that lead to efficient GPU code:

1. Multiple threads per core to make use of the dual vector units introduced in Haswell

2. SIMD-like thinking to remap tasks into the 8-way and soon to be 16-way vector units

3. Running multiple threads across multiple cores

4. Micromanaging the L1 cache and treating the AVX/SSE registers as L0 cache

Where the CPU prevails is for fundamentally serial algorithms that cannot be mapped into a SIMD implementation. Mike Acton's Data-Oriented Design covers this case nicely IMO.