
The Past, Present, and Future of the CPU, According to Intel and AMD - 0xb0
http://www.gamespot.com/articles/the-past-present-and-future-of-the-cpu-according-t/1100-6421514/
======
ewzimm
I'm disappointed that the article leads with Intel's performance lead. I've
always found it to be meaningless. Their best CPU is over $2500. Their best i7
is over $1000. I use an AMD A4-3300. It's fast enough for my needs and $20. On
the other hand, I also have a Bay Trail Atom tablet, and it has fantastic
performance for its power usage. The article talks about what a failure Atom
has been. I think part of the blame is tech journalism promoting the fastest
silicon. It appeals to some psychological desire, but it doesn't make much
practical sense. Battery life on mobile and value on the desktop is all I
really care about. Why is it such an embarrassment for both AMD's desktop
division and Intel's mobile division to be the most efficient?

~~~
runeks
Agreed. When I design a new PC, I don't look for the highest performance
system, I look for performance per dollar. This chart is very useful:
[http://www.cpubenchmark.net/cpu_value_available.html](http://www.cpubenchmark.net/cpu_value_available.html)

When designing a new PC about a year ago, I went with an eight-core AMD
FX-8320. Which was significantly cheaper than an equally fast Intel i5 or i7
CPU (although the per-thread performance is only half of a 4-core i5/i7 with
the same Passmark score).

------
Handwash
"If Epic and its Unreal engine on console don't have a threaded graphics
pipeline--which to date they don't"

I'm surprised to know that this thing is happening. I thought everyone has
optimized their program to go multi-threading.

~~~
m0th87
Current game engines are still bound by the single-threaded game loop model,
although not by necessity. Here's a really cool presentation about it:
[https://www.st.cs.uni-
saarland.de/edu/seminare/2005/advanced...](https://www.st.cs.uni-
saarland.de/edu/seminare/2005/advanced-fp/docs/sweeny.pdf)

------
yazaddaruvala
I'm curious.. let me know good or bad (or wrong) what you guys think of this.

Quick background. My understanding of how a mobile phone works: There is a
primary CPU, running Android, which does the majority of the work on the
phone. Meanwhile, there is a second CPU attached to the radio running an RTOS.
This RTOS interprets the signals from the antenna and makes nice packets for
Android. The RTOS and secondary processing unit can then be optimized for just
that one task, and Android can be optimized for reading, writing and
processing data packets.

Similarly, what if we made more parts of our infrastructure "smarter"? Take
for example the monitor. Currently, it has a frame buffer, we fill it through
a DVI cord and the lights change. Is there some abstraction at which we could
make a monitor work "smarter". Can we put a GPU in the monitor? Then just move
memory and call OpenGL commands. Is that a better abstraction for monitors?
Can we also put a CPU on the monitor and embed an RTOS/rendering engine? Would
that make a better abstraction? Could we then optimize the "smarter" monitor /
integration with the game logic, better than we could have if the monitor just
has a frame buffer?

I don't have much domain knowledge and so I don't know what the correct
abstraction for a "smarter" monitor should be. However, I do think its a good
question to be asking (and not just about monitors). I'm curious if you guys
can think of such an abstraction. What would that be?

~~~
pandaman
GPUs are already pretty smart and are controlled by their own processor(s).
Moving them into monitor is not going to make them smarter. On the other hand,
communication with a GPU hooked up with a flexible cable is going to be more
complicated than with a GPU sitting on a wide internal bus or, as it becomes
more common, on the same chip with the CPU.

------
kens
I'd like a better understanding of modern processor microarchitectures, i.e.
what's happening inside the chip. What do you recommend I read? I'd like to be
able to understand a diagram like: [http://www.realworldtech.com/wp-
content/uploads/2012/10/hasw...](http://www.realworldtech.com/wp-
content/uploads/2012/10/haswell-5.png?71da3d)

~~~
Dwolb
I'd recommend this course:
[https://www.coursera.org/course/comparch](https://www.coursera.org/course/comparch)

If you're too advanced for this, I'd consider writing some behavioral VHDL or
Verilog for a few of the units to see how a few of the pieces fit together.

------
KerrickStaley
Mantle still seems somewhat far out—the article mentions that the spec won't
be published until the end of this year, and there's no word on when Linux
support will come (though AMD has said it will happen).

~~~
jeffreyrogers
Do you know of a good explanation of Mantle's goals? What sets it apart from
DirectX or OpenGL?

~~~
corysama
OpenGL and and D3D are both very good APIs that have worked well for 20+
years. However, they have a few ideas about the hardware fundamentally baked
into them that are not aging well.

The main issue is that they are based on a model of continuously modifying a
very large, monolithic body of state representing fine details about what the
next draw should do. At any moment a draw call may be issued to enact the
current state and produce a result.

In the past, that state was represented in hardware mostly using a large
collection of physical registers. Nothing else could possibly be fast enough.
The API model of "set BlendStateSourceOp, set BlendStateDestOp, ect..." mapped
very well to the hardware. You literally were continuously mutating a large
block of registers.

In the present, programmable hardware has become capable of largely taking
over for fixed-function hardware. Modern GPUs have been increasingly cutting
out special-purpose silicon to make room for more multi-purpose ALUs. These
general-purpose ALUs represent how to draw using fairly large, allocated
structures instead of single-purpose registers. These structures are not
trivial to construct and modifying them continuously is not advised. However,
switching between them is as trivial as moving a pointer from one to the
other.

Fortunately, most games don't actually use a continuum of states when drawing.
In practice, they switch repeatedly between a small number of states with very
little variation between frames. Therefore, modern drivers do a lot of work to
implicitly infer what state setups are heavily repeated within each run each
application. These states are baked into structures under the hood on the fly.
Odd variants are expensive in this mode. But, they are also rare, so they are
lower priority.

Mantle, Metal and DX12 all seek to reboot the idea of graphics APIs from
scratch based on how hardware actually works today. You set up a an explicit
set of draw state structures at init time. You switch between them explicitly
and trivially at run time.

A second issue baked into OGL/D3D is that, in the past, the monolithic draw
state was stratified into quite nicely orthogonal chunks dealing with separate
issues such as: how to load a vertex from memory vs. how to operate on a
vertex vs. how to pass data from the vertex shader to the fragment shader vs.
how to operate on a fragment (sample) vs. how to blend the fragment into the
framebuffer. This model made the APIs quite nice to learn and to use.

Unfortunately, it is simply not representative of how the hardware actually
operates today. Today, most of those operations are actually handled by
general purpose ALUs. These ALUs are running the vertex and fragment programs
you wrote. But, they are also running more code to handle what used to be done
in fixed-function silicon. Actually, it's worse than that. What used to be a
register flip that was completely orthogonal to your vertex/fragment programs
is now actually implemented by modifying code interleaved into the guts of the
programs you compiled back at init time. These changes are done under the hood
and on the fly.

Modifying the code under the hood is expensive. Worse, the draw state is so
large and complicated that it is easy to accidentally request an invalid
state. Validating each given state is expensive. Because the classic model
lets you make draw state changes at any time preceding a draw and the state
changes are no longer stratified, the state validation can no longer be done
incrementally. Instead, every time you draw a significant amount of work is
done just to make sure the request makes sense.

Again, by declaring draw states up front. Compilation and validation can be
done once up front. Switching between pre-compiled, pre-validated states is
trivial.

A third issue is that OGL/D3D have the genuinely great goal of preventing
and/or detecting synchronization errors in the usage of the API. In other
words, you really shouldn't try to have the CPU modify a given block of memory
while the GPU is simultaneously reading that same memory in an uncoordinated
fashion. OGl and D3D have an interface and implementation designed to
prevent/detect/allow-at-a-huge-cost these usage errors as much as possible. In
practice, serious programs cannot ship with these errors. That means that in
practice, all serious, shipping programs do not have these errors to any
significant degree, but the driver is still always doing a large amount of
work checking for them all of the time.

The new-style APIs seem more inclined to declare this category of usage errors
to be undefined behavior rather than pay the cost to handle them. "Here's how
to avoid them. So... avoid them."

A fourth issue is that multi-core computing is much more common and important
than it was in the past. OpenGL has never had in interface to issue draw
command from multiple threads of a single process. D3D11 had an interface to
record commands on multiple threads and dispatch them on a primary thread, but
the consensus is that D3D11's implementation did not work as well as was
expected in practice.

Mantle, Metal and DX12 all have new, multi-threaded interfaces that they are
quite confident will work well in practice.

Much of what I'm describing here is covered in this presentation from
Microsoft "DirectX 12 API Preview"
[https://www.youtube.com/watch?v=m0QkjKGZQzI](https://www.youtube.com/watch?v=m0QkjKGZQzI)

An alternative approach has been proposed by a multi-vendor group of OpenGL
driver developers. It was presented in the "Approaching Zero Driver Overhead"
(AZDO) talk at GDC 2014.
[http://gdcvault.com/play/1020791/](http://gdcvault.com/play/1020791/) and
[https://www.khronos.org/assets/uploads/developers/library/20...](https://www.khronos.org/assets/uploads/developers/library/2014-gdc/Khronos-
OpenGL-Efficiency-GDC-Mar14.pdf)

In the AZDO approach, instead of tossing out the legacy state machine of
OpenGL, they demonstrate how some current (fairly cutting edge) features that
have recently been added allow a draw state to be set up that is so expressive
and so extensive that it can pretty effectively represent a whole, fairly
complicated scene of a modern game in a single draw state. Once you set this
up, you can pretty much issue a single request to draw much-if-not-all of the
current frame as an atomic operation. Further, common frame-to-frame
modifications (such as moving objects around) are very cheap in this setup.

ADZO is an interesting and perfectly workable approach. I am less of a fan of
that approach than I am the DX12 approach.

I should make this into a blog post... I should start a blog...

~~~
przemo_li
AZDO is not about "single draw per frame" nor "single draw per scene".

Its "single draw per timeframe needed for switching state".

Difference bing that modern GPUs can "hide" state change behind big enough
workload.

Also OpenGL as is right now, allow for explicit GPU/CPU synchronization. Multi
threaded content creation (without needing any explicit api for it).

What it lacks are: * Requirement for caching shaders. (Quality of
implementation) * Requirement for offline shader compilation. (Quality of
implementation) * Intermediate representation (so that mundane task like
elimination of dead code happen ahead of time) (Specification) * Having all
that good stuff in core (Specification) * App devs moving to Core Profile
(thats us...)

So OpenGL as is now, is quite close to solving all Your problems. And it do it
somehow-less-somehow-more explicit then DX12/Mantle (as those focus on
exposing CPU/GPU intensive operations, while OGL "AZDO" go ahead and propose
solution to solve GPU/CPU bottlenecks)

~~~
przemo_li
In that sense You should add disclaimer at the beginning of this blog post ;),
that by "OpenGL" You mean both OpenGL without AZDO extensions and OpenGL ES.

------
zanny
> could run 64-bit operating systems, which could address more than 4GB of RAM

I just want to vent how insane it is how the x86 ecosystem moved to 64 bit
because _Microsoft refused to support physical address extensions in XP_. It
was completely artificial, it was not a technical limitation. Hell, we really
have no excuse to be on 64 bit even today - the improved integer and floating
precision is nice, but few people cared about that. With PAE, you are only
limited by the per-application address space... which you are _still_ limited
by today if you are running a 32 bit program. And in the Windows ecosystem,
since there is so much less free software and no package management,
everything is distributed as a 32 bit binary to be compatible across the
board.

Which means in practice nobody ever needed 64 bit. And those that did had
business reasons to do so. It is _true_ that Itanium failed because Intel was
ignorant of the massive inertial resistance to trying to bring tools across
architectures - there is a reason OS/2 still exists, and that is just cross-
OS, you can still use the same ASM at least. And you would always be surprised
how many enterprise programs are ASM bound somewhere because some unholy
desecration of coding practices goes on in the bowels of that non-source-
controlled shared repository of suffering.

But I'm talking about the consumer space here - if AMD64 took off as the Xeon
class of 2004 that would have been fine, but we made this transition for
pretty much _no reason at all_. I have no idea why Microsoft thought making
Windows 64 bit for the Vista release (there was 64 bit XP, but that was really
rare and business focused) was easier than just supporting PAE.

But that is happening _again_ today. Kind of in reverse, though. Or maybe it
is just a precedent that was set? Apple has now opened the floodgates for
ARMv8, and now everyone and their mother wants 64 bit buzzwords on their
products while their phones still ship with 3GB of less of memory, and their
architecture _again_ supports PAE without issue.

Because there are performance ramifications here. You can fit less into your
cache lines when every address takes up twice the space. There is a reason to
try fitting your data in the smallest format possible. You end up with more
pages in aggregate from all the wasted space, and thus consume more memory
implicitly. And we never even got real 64 bit - x86 chips are still physical
48 bit, because somewhere along the lines it was realized "64 bits is
ludicrous amounts of memory".

I think the best part is that memory limitations are also solvable problems.
If your program hits its memory limit (ie, a 32 bit binary with 4GB
addressing) you can just fork a process instead of a thread, and suddenly you
double your working space. It is insanely rare to have an active working set
of more than 4GB where that data needs to be address-local available for
access or else you suffer huge slowdowns, and even more so in routines where
you cannot load balance to delegate processes that manage ranges of values
that get that big.

Likewise, if you hit a kernel address limit (remember, with PAE, you can get
anything from a magnitude to thousandfold increases in available physical
memory) your problem is so huge it makes sense to have it worked on by a
server farm compute cluster. Unless you tried to voodoo all that circuitry
together into some supermassive NUMA system (sounds painful) which wouldn't
make any sense anyway because trying to abstract away network interconnects
between disparate nodes at that scale has so much latency trying to treat it
like memory when it is as slow as flash storage is redundant.

....

Ok, tangential rant over. I just think 64 bit is such a stupid buzzword waste
of time, and its crazy that the industry has gone deep end on it for so long
because it exploits some cultural tick in people that bigger is better. No, in
practice it really does not "hurt" to have, but we (at least the consumer
class, and 99% of business use cases) never needed it in the first place,
really.

~~~
joenathan
>Microsoft refused to support physical address extensions in XP

Completely wrong

"The original releases of Windows XP and Windows XP SP1 used PAE mode to allow
RAM to extend beyond the 4 GB address limit. However, it led to compatibility
problems with 3rd party drivers which led Microsoft to remove this capability
in Windows XP Service Pack 2. Windows XP SP2 and later, by default, on
processors with the no-execute (NX) or execute-disable (XD) feature, runs in
PAE mode in order to allow NX.[14] The no execute (NX, or XD for execution
disable) bit resides in bit 63 of the page table entry and, without PAE, page
table entries on 32-bit systems have only 32 bits; therefore PAE mode is
required in order to exploit the NX feature. However, "client" versions of
32-bit Windows (Windows XP SP2 and later, Windows Vista, Windows 7) limit
physical address space to the first 4 GB for driver compatibility via the
licensing limitation mechanism, even though these versions do run in PAE mode
if NX support is enabled.

Windows 8 will only run on processors which support PAE, in addition to NX and
SSE2."

\-
[http://en.wikipedia.org/wiki/Physical_Address_Extension#Micr...](http://en.wikipedia.org/wiki/Physical_Address_Extension#Microsoft_Windows)

~~~
zanny
The driver instability was only their fault. The fact is the reason AMD64
"won" was because in the late 2004 era, with SP2, Microsoft killed PAE (the
concept, not the implementation, by artificially limiting addressable memory),
and the artificial memory limit on 32 bit systems doomed the architecture.

I speak only of the time period from 2004 - 2008 where we went from Pentium 4
to Core 2 (and Athlon64 to Phenom), where 64 bit became ubiquitous because
Microsoft failed to provide its consumer grade OSes the ability to access the
amounts of memory the hardware supported.

~~~
joenathan
Everyone crapped on Microsoft because there were no x64 drivers for Vista, and
the 3rd party manufacturers didn't want to create new drivers for their old
hardware.

It's Microsoft's fault that 3rd party driver support sucks? So they get blamed
with they keep compatibility as in the case of PAE on x86 and blamed when they
break it.

------
chucknelson
First thought: "This is a Gamespot article?" Seems like something you'd see at
Anandtech.

~~~
KerrickStaley
Except that it's covering topics that were already well-known in the tech
community 6 months ago.

------
jeffreyrogers
The idea that Moore's law is still continuing is a bit misleading. For all
practical purposes it's over. See the graph here:
[http://www.extremetech.com/computing/116561-the-death-of-
cpu...](http://www.extremetech.com/computing/116561-the-death-of-cpu-scaling-
from-one-core-to-many-and-why-were-still-stuck)

Yes, we can still cram more transistors on a chip, but we can't get the clock
speed any faster because we run into power/heat limitations. And actually if
you look at the graph in the link I posted you'll see that clock speed peaked
around 2005. Around that time AMD and Intel focused more on multiple cores and
clock speed became less important.

Tangent: Personally, I like ARM much better than x86/x86-64. RISC based
architectures just seem so much nicer. Plus, they are easier to optimize for
and more predictable in terms of clock cycles per instruction.

~~~
m0th87
Moore's law is about the number of transistors, not clock speed.

"The number of transistors incorporated in a chip will approximately double
every 24 months." \- Gordon Moore:
[http://www.intel.com/content/www/us/en/history/museum-
gordon...](http://www.intel.com/content/www/us/en/history/museum-gordon-moore-
law.html)

~~~
jeffreyrogers
That's true, but the reason people care about it is because in the past more
transistors led to higher clock speeds, which is what I was getting at by
saying we _can_ still put more transistors on a chip, but that power and heat
limitations prevent that from turning into increased clock speeds.

~~~
m0th87
Yeah. I'd be curious to know how much of that is a product of the
software/hardware architectures we currently use, e.g. programming languages
generally not built from the start with parallelism in mind.

~~~
m_mueller
Look at it this way:

1\. Most software today is I/O bound, which is when programmers don't (and
shouldn't care) about shared memory parallelism.

2\. Most popular programming languages today are based on classic imperative
programming going back Fortran and co.

3\. These classic languages suggest using loops.

4\. Loops are inherently not parallelizable. Only in a specific case where
there is no carried on dependency, a loop becomes parallelizable.

5\. These languages have basically infested everything we do, including
compute / memory bandwidth bound problems that _should_ now be treated in
parallel. (Even bandwidth goes down with sequential execution, for example on
Intel sockets usually by a factor of 2).

6\. Since therefore most parallel things get written in loops, this becomes a
hard problem. What the compiler vendors are doing is usually flinging
[directives]([http://www.openacc.org/](http://www.openacc.org/)) [at
us]([http://openmp.org/wp/](http://openmp.org/wp/)).

7\. Directives work well until they don't and you have no idea why, because
it's usually a black box for most programmers (who can't read assembly like
code).

8\. What we _should_ get is a way of saying: Here is some scalar code. I'd
like this code to be applied in parallel over domains X, Y, Z etc. Be aware of
symbols alpha and beta that are dependant in X, Y, Z as well as gamma that is
dependant in X only.

10\. This should be available _at language level_ , so programmers start
thinking in these terms. Only then do we have a reliable way of making use of
data parallelism.

11\. CUDA and OpenCL are actually pretty close to this, but slightly too low
level and generally thought as being hard to program in (which I don't agree,
but that's the image).

Disclaimer: I've been involved in this problem space since some time and
[this]([https://github.com/muellermichel/Hybrid-
Fortran](https://github.com/muellermichel/Hybrid-Fortran)) is what has come
out of it. It's HPC targeted, but at some point I'd like to make this whole
parallel computing thing more generally approachable.

