
Modern Microprocessors: A 90 Minute Guide - antonios
http://www.lighterra.com/papers/modernmicroprocessors/
======
djcapelis
This is a really good overview of modern processor design. I keep noticing
that a lot of people don't seem to understand the CPU designs past the five
stage pipeline taught in undergrad architecture courses and this document does
a lot of good work in targeting the areas that folks are most likely to have
misconceptions about.

I do wish they covered trace caches, or as they're known partly in their more
modern form, u-op (micro-op) caches, which are back in modern Intel chips
again and cause some interesting performance artifacts. (The old trace caches
of the P4 chips are different than the u-op caches of the new architectures,
since the trace cache actually encoded branch predictions into the actual
cache line lookup, which was always pretty wild.)

~~~
chm
What reading do you recommend for someone who wants to learn about basic CPU
design? I'm not a CS major, but I'm interested.

~~~
sliverstorm
The old standby, _Computer Architecture: A Quantitative Approach_ , works you
up through the basics.

~~~
throwaway_yy2Di
Hold on, don't you mean _Computer Organization and Design_ , by the same
authors? The one with these pages:

[https://books.google.com/books?id=RXARim9cNBIC&pg=PA288&dq=b...](https://books.google.com/books?id=RXARim9cNBIC&pg=PA288&dq=basic+implementation+mips+subset)

~~~
scott_s
They're different books. "Computer Architecture: A Quantitative Approach" is
an introduction to the subject for people who will work in the area. "Computer
Organization and Design" is for people who need to understand how processors
and hardware systems work in order to do their own work. (Mostly.)

The preface to "Organization and Design" says basically this. For what it's
worth, "Computer Architecture" is sitting on my shelf, and that's what I used
in grad school. But based on their preface, I may buy "Organization and
Design" because it may be a better reference for what I do day-to-day.

------
jwr
This article is spectacularly good. I wish I had this available when I started
doing assembly-level optimizations on x86 chips. This knowledge used to be
much more fragmented and difficult to learn.

VLIW could indeed be left out: you are not likely to encounter a VLIW chip,
and if you do, it will come with an (excellent) compiler that will do most of
the hard work for you.

A good followup article would be a tutorial on how to lay out your
structures/arrays in memory given your access patterns and cache architecture.

~~~
dfox
In my experience, when you encounter VLIW cores, there is almost no tooling
for it as they tend be application specific DSP-ish cores. In that case hand
optimized assembly is the way to go as there is no budget to produce
optimizing compiler of anything.

~~~
jwr
My experience was with Texas Instruments C6000 DSPs. The compiler is excellent
and if you use it well, you rarely have to resort to assembly. Even then, you
normally write linear assembly, not parallel assembly, letting the assembler
take care of the rest.

------
willvarfar
An absolutely excellent article!

For everyone interested in the topic, you might enjoy the new Mill CPU
architecture talks [http://ootbcomp.com/docs/](http://ootbcomp.com/docs/) \-
the very next talk is streamed live today (5th Feb, 16:15 PST
[http://ootbcomp.com/topic/instruction-execution-on-the-
mill-...](http://ootbcomp.com/topic/instruction-execution-on-the-mill-cpu-
talk-feb-5-2014/) )

(I am a Mill forum mod; ask me anything about the Mill ;)

~~~
fmstephe
What market is the mill aiming for? Servers, phones, desktop or all of the
above?

~~~
willvarfar
All of the above! Its a family of processors, all compatible running the same
binaries but each with the belt sizes, vector sizes, functional units and so
on that suit it.

So you'll get smaller Mills where that makes sense and absolute monsters in
supercomputers, for example.

When you think about the "smaller" Mill that you might have in your phone and
tablet, though, its a monster compared to today's desktops! Except in the
power efficiency department, that is ;)

~~~
fmstephe
Have they indicated how they plan to get adoption?

~~~
willvarfar
The hackaday interview talks through some of the options on the business side:
[http://hackaday.com/2013/11/18/interview-new-mill-cpu-
archit...](http://hackaday.com/2013/11/18/interview-new-mill-cpu-architecture-
explanation-for-humans/)

------
pjmlp
Very nice written article.

So how does ANSI/ISO C expose those details vs other languages, as many claim
to?

~~~
Symmetry
No, and generally even the binary itself doesn't expose any of that, since you
want the same program to run on your in-order atom and your OoO,
multithreaded, Core iFoo.

~~~
wtallis
The big exceptions being SIMD and VLIW, two similar forms of explicit
parallelism. They more or less require programming language support to use
fully, and most older languages are purely scalar though some can be used in a
manner that is fairly amenable to automatic vectorization.

~~~
Symmetry
At the ISA level you're right, those do require explicit instructions (or
bundling of instructions). At the programming language level no.

VLIWs can be programmed by normal compilers with well understood techniques
(though variable memory latency makes static scheduling hard to do well in
practice for most workloads). You just throw your C code into the compiler and
it will spit out valid binaries and this has worked for as long as there have
been VLIW machines.

The techniques for auto-parallelizing 'for' loops by compilers into SIMD
instructions are a more recent development, but they certainly exist. The
Intel C Compiler is particularly good at that, but Clang and GCC can do this
too.

~~~
wtallis
I _did_ mention automatic vectorization. I know it exists, but it is not
perfect even in Intel's implementation. Compilers simply cannot be relied upon
to always find and take advantage of opportunities for parallelism in serial
code. Having language constructs that explicitly express that parallelism
helps ensure that the compiler at least tries to generate parallel code, where
given straight C it might just decide that the loop body is too big to bother
with or that the side effects are too complicated to prove it can be
parallelized.

------
gtani
here's another good intro to CPU design, floating point math, linear algebra,
PDE solvers etc
[http://www.tacc.utexas.edu/~eijkhout/Articles/EijkhoutIntroT...](http://www.tacc.utexas.edu/~eijkhout/Articles/EijkhoutIntroToHPC.pdf)

------
kmitz
Very good overview, thanks a lot for this article

------
jokoon
hope that makes some people want to learn about some very optimization basics.

