
What's New in CPUs Since the 80s and How Does It Affect Programmers? - StylifyYourBlog
http://danluu.com/new-cpu-features/
======
bluetomcat
The constant widening of the SIMD registers is an interesting trend to watch.
MMX started off as 64-bit, and now AVX-512 is 8 times that. With many cores
and a couple of SIMD units in each core, aren't CPUs becoming ever more
suitable to handle graphic workloads that were once reasonable only for GPUs?

~~~
Symmetry
Everything vardump said is true, but additionally there's the consideration of
what isn't there as well as what is there. By not having an out of order
execution setup or a broad forwarding network a GPU core is able to be much
smaller than a CPU core of equivalent throughput. By "core" here I mean
something capable of independently issuing instructions and executing them,
not calling each execution unit a "core" the way Nvidia does (though with SIMT
that isn't as crazy as it might look at first). That means you can pack more
cores on a chip and use less power for a given amount of work, since power
consumption is roughly proportional to the number of transistors used. If we
eventually start doing graphics on CPUs it will because we move to ray tracing
or other sorts of algorithms that GPUs are bad at.

~~~
Symmetry
One more thing. GPUs don't have to worry about precise interrupts like if you
try to load a piece of memory only to find that you have to page it in or
such. If they're in a memory protected setup where that's even possible it'll
be the CPU who has to deal with the mess, the GPU can just halt while that's
taken care of.

This makes SIMT easy. SIMT is "Single Intstruction Multiple Threads" which is
sort of like SIMD but better. You have an instruction come in which is
distributed to multiple execution lanes each of which has a hopper of
instructions it can draw from. Each lane then executes those instructions
independently and if the lane notices that the instruction has been predicated
out it just ignores it. Or maybe if one lane is behind it can give the
instruction to its neighbor which isn't. THe fact that you don't have to be
able to drop everything in a single cycle and pretend you were executing
perfectly in order gives the hardware a lot of flexiblity, and all of this
complexity is only O(n) with the number of executions instead of O(n^2) as
with a typical OoO setup. The need to have precise exceptions involving SIMD
instructions means that this isn't a simple thing for a regular CPU core to
add to its SIMD units.

The fact that each lane is making decisions about when to execute the
instructions it has been issued are why some people refer to the lanes as
"cores". I don't, because what they're doing isn't any more complicated than
the 8 reservation stations in a typical Tomasulo algorithm OoO CPU core would
be doing even if they are smarter than a SIMD lane. With GPUs it makes more
sense to break down "cores" by instruction issue, in which case a high end GPU
would have dozens rather than thousands of cores.

~~~
vardump
I think GPUs do have faulting mechanisms, if that's what you meant by "precise
interrupts". How else they bus master page memory over PCIe?

> Each lane then executes those instructions independently and if the lane
> notices that the instruction has been predicated out it just ignores it.

This is exactly what you do on CPUs as well. In SSE/AVX you often mask
unwanted results instead of branching. Just like on GPUs. AVX has 8 lanes, 16
with Skylake.

~~~
Symmetry
Regarding faulting mechanisms, if you've got a discrete GPU on a PCI bus then
it's a separate piece of silicon that handles the network protocol. The
important point is that I don't believe that the GPU cores have to be able to
halt execution and save their state at any instruction.

It's certainly true that SIMD instructions in CPUs have predication which
saves you a lot of trouble. The difference is that if you have two
instructions which are predicated in a disjoint way you can execute them both
in the same cycle in a SIMT machine but you would have to spend one cycle for
each instruction in a SIMD machine. You can look at Dylan16807's link for all
the details.

~~~
tmurray
GPUs don't support precise exceptions. For example, you can't take a GPU
program that contains a segfault, run it as a standard program (as in, not in
a debug mode), and be presented with the exact instruction that generated the
fault.

------
pedrocr
> _If we set _foo to 0 and have two threads that both execute incl (_foo)
> 10000 times each, incrementing the same location with a single instruction
> 20000 times, is guaranteed not to exceed 20000, but it could (theoretically)
> be as low as 2._

Initially it seemed to me the theoretical minimum should be 10000 (as the
practical minimum seems to be). But it does indeed seem possible to get 2:

1) Both threads load 0

2) Thread 1 increments 9999 times and stores

3) Thread 2 increments and stores 1

5) Both threads load 1

5) Thread 2 increments 9999 times and stores

6) Thread 1 increments and stores 2

Is there anything in the x86 coherency setup that would disallow this and make
10000 the actual minimum?

~~~
logfromblammo
I think not. The only limit seems to be the number of times one CPU can
increment while the other is between load and store operations.

If you could manipulate your two cores with enough dexterity, you could force
a 2 result, but without enough fine-grained control, the practical limit is
going to be determined by the number of increments that can be "wasted" by
sandwiching them between the load and store operations of the other thread.
You would essentially need to stop and start each CPU at four very specific
operations.

The practical results show that "load + add + store + load + add + store" on
one thread probably never happens during a single "add" on the other thread.
You would need that to happen at least once to get below 10000. Otherwise,
each increment can waste no more than one increment for the other thread, and
you end up with at least 10000.

The experimental numbers are probably indicative of how long the add portion
of INCL takes in relation to the whole thing.

------
acallan
The author is incorrect in the section about memory fences. x86 has strong
memory ordering [1], which means that writes always appear in program order
with respect to other cores. Use a memory fence to guarantee that reads and
writes are memory bus visible.

The example that the author gives does not apply to x86.

[1]There are memory types that do not have strong memory ordering, and if you
use non-temporal instructions for streaming SIMD, SFENCE/LFENCE/MFENCE are
useful.

~~~
martincmartin
The authors point still holds if writes might not appear in program order on
other CPU sockets, not just other cores within the same socket.

Do you have a reference for the strong memory ordering on x86? I'd like to
read more about it.

~~~
rayiner
x86 enforces (essentially) total store order across all sockets:
[http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html](http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html).
The barriers are still useful for kernel code because other processors on the
machine usually don't participate in the cache-coherency protocol.

------
gshrikant
For a somewhat dated read for an introductory/high-level survey of the state
of CPU design up to 2001, I find [1] to be quite informative. Of course,
having been written in the frequency scaling era some of the 'predictions' are
way off the mark. Nevertheless, I find it a good resource for someone looking
to get a bird's eye view of the 30+ years development in the field.

[1]
[http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=0...](http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=00964437.pdf)

------
antiuniverse
If you enjoyed this article, you might also want to check out Agner Fog's
optimization manuals and blog (which I decided were probably worth a separate
submission):
[https://news.ycombinator.com/item?id=8874206](https://news.ycombinator.com/item?id=8874206)

------
pjmlp
All features that make C look high level nowadays, contrary to what many
think.

None of them are exposed on ANSI C.

~~~
bhouston
Well, you can still use them in C/C++ code via:

#include <mmintrin.h> // Intel MMX

#include <xmmintrin.h> // Intel SSE

#include <emmintrin.h> // Intel SSE2

#include <pmmintrin.h> // Intel SSE3

#include <tmmintrin.h> // Intel SSSE3

#include <smmintrin.h> // Intel SSE4.1

#include <nmmintrin.h> // Intel SSE4.2

#include <ammintrin.h> // Intel SSE4A

#include <wmmintrin.h> // Intel AES

#include <immintrin.h> // Intel AVX

#include <zmmintrin.h> // Intel AVX-512

#include <arm_neon.h> // ARM NEON

#include <mmintrin.h> // ARM WMMX

And a full list of what is possible in Microsoft Visual C:
[http://msdn.microsoft.com/en-
us/library/hh977022.aspx](http://msdn.microsoft.com/en-
us/library/hh977022.aspx)

Now the reason why they are not in ANSI C is that low-level CPU features (just
like raw instructions) are not portable by nature.

~~~
pjmlp
Where are the headers for out-of-order execution, cache levels, execution
pipelines, ...?

Additionally not all C compilers provide such headers.

~~~
bhouston
There are a lot of prefetch instructions and write-barriers in the available
intrinsics:

[http://msdn.microsoft.com/en-
us/library/hh977022.aspx](http://msdn.microsoft.com/en-
us/library/hh977022.aspx)

I do not think one has full control over OOO execution on x86-64 processors.
Also I do not believe one has control over the execution pipeline even in
assembly, although I do not know exactly what you mean by that, so it could
just be a misunderstanding.

~~~
pjmlp
My whole point is that this isn't C any longer, it is Assembly.

~~~
seunosewa
But assembly doesn't have instructions for "out-of-order execution, cache
levels, execution pipelines, etc"

~~~
pjmlp
That is what Assembly data sheets provide.

------
martincmartin
In the section on rdtsc, the author should really mention that there's a new,
serializing version called rdtscp, so you should prefer that.

~~~
uxcn
rdtscp also consumes an extra register though.

------
majke
Another exciting thing with the new PCI-Express standard is the Direct Cache
Access, especially useful for high speed networking:

[http://web.stanford.edu/group/comparch/papers/huggahalli05.p...](http://web.stanford.edu/group/comparch/papers/huggahalli05.pdf)

~~~
donavanm
Speaking of, I think we'll see nvme being the new hotness this year. NVME is
going to bring crazy parallelism and queue depth to block devices. Finally
we'll be able to exploit flashs inherent parallel access in a sane way.

[http://en.m.wikipedia.org/wiki/NVM_Express](http://en.m.wikipedia.org/wiki/NVM_Express)

~~~
ksk
Anandtech already did a review..

[http://www.anandtech.com/show/8147/the-intel-ssd-
dc-p3700-re...](http://www.anandtech.com/show/8147/the-intel-ssd-
dc-p3700-review-part-2-nvme-on-client-workloads/2)

------
avian
> If we set _foo to 0 and have two threads that both execute incl (_foo) 10000
> times each, incrementing the same location with a single instruction 20000
> times, is guaranteed not to exceed 20000, but it could (theoretically) be as
> low as 2.

Can someone explain the scenario where this test would result in _foo = 2? The
lowest theoretical value I can understand is foo_ = 10000 (all 10000 incls
from thread 1 are executed between one pair of load and store in thread 2 and
hence lost).

~~~
ColinDabritz
You're almost there! Basically that swapping you outlined occurs twice, once
on each side.

1\. Both threads read '0', beginning their first incl instruction (A: 0, B: 0,
Store: 0)

2\. Thread A completes execution of all BUT ONE incl, writing it's second to
last value to the store (A: 9999, B: 0, Store: 9999)

3\. Thread B increments it's '0' to '1' (A: 9999, B: 1, Store: 9999)

4\. Thread B writes it's '1' to the store (A: 9999, B: 1, Store: 1)

5\. Thread A begins it's final incl and reads the '1' that thread B just
stored (A: 1, B: 1, Store: 1)

6\. Thread B executes all remaining incl instructions, writing its final value
to the store (A: 1, B:10000, Store: 10000)

7\. Thread A continues it's final incl and increments it's '1' to '2' (A: 2,
B:10000, Store: 10000)

8\. Thread A stores it's final '2' result (A: 2, B:10000, Store: 2)

9\. All instructions have executed, and the result is '2'

It is, of course, completely theory and extremely unlikely under real use as
the authors graph shows, but it's one of those threading 'gotchas' that lead
to occasionally unpredictable results.

------
zurn
These features debuted way before the 80s, arguably with the exception of out
of order execution. Take any feature discussed and its wikipedia history
section will tell you about CPUs and computers in the 60s that first had them.
Caches, SIMD (called vectors back then), speculative execution, branch
prediction, virtual machines, virtual memory, accelerated IO bypassing the
CPU... all from the 60s.

~~~
pjmlp
Just like everything else in computing. We are still trying to catch up with
the Xerox PARC model of live coding, for example.

However just a selected few could touch said hardware and you could buy a very
nice house with what they used to cost.

------
uxcn
One of the other weird things is that the general purpose chips are supposed
to be getting large chunks of memory behind the northbridge. If I'm not
mistaken, it's supposed to be used as another level in the cache hierarchy,
which could be another boon to things that leverage concurrency.

------
amelius
Nice improvements. But they seem only marginal. Yes, speed may have gone up by
a certain factor (which has remained surprisingly stable in the last decade,
or so it seems).

On the other hand, programming complexity has gone way up, while
predictability of performance has been lost.

------
cowardlydragon
Doesn't this come down to:

#1: Fit in cache.

#2: Try to multithread

all the rest is marginal

