Hacker News new | past | comments | ask | show | jobs | submit login
Modern Microprocessors: A 90 Minute Guide (lighterra.com)
226 points by antonios on Feb 4, 2014 | hide | past | web | favorite | 37 comments

This is a really good overview of modern processor design. I keep noticing that a lot of people don't seem to understand the CPU designs past the five stage pipeline taught in undergrad architecture courses and this document does a lot of good work in targeting the areas that folks are most likely to have misconceptions about.

I do wish they covered trace caches, or as they're known partly in their more modern form, u-op (micro-op) caches, which are back in modern Intel chips again and cause some interesting performance artifacts. (The old trace caches of the P4 chips are different than the u-op caches of the new architectures, since the trace cache actually encoded branch predictions into the actual cache line lookup, which was always pretty wild.)

I agree, this is the best overview on the topic I have seen so far. Of curse, there is only so much you can squeeze into such a short overview. E.g. you probably don't really get VLIW if you don't know about trace scheduling. Or many cores without knowing about cache coherency. Etc. That's where you need to go on and read the references at the end.

I wouldn't have minded if they cut out VLIW. That debate is mostly academic at this point and there's good reasons none of them have ever really been successful. If you're writing a guide to how modern chips actually work, it seems unnecessary to include a mostly dead path of work, but I've been hating on VLIW for a long time, so that's easy for me to say.

Yes, VLIW is great for DSPs but not very much else. When the guide was written in 2001, Transmeta and Itanic were still around. So it made sense to include it back then.

Itanium has been outselling SPARC and POWER for most of its life. It may have failed at Intel's original job (deprecate the marginally open x86 space in favour of an Itanium closed shop), but it's been a commercial success against its main competitors.

It's possible that Itanium outsold POWER but only if you ignore PowerPC which was its most successful commercialization. As an architecture it existed not only in servers but in in PowerMacs, Xbox 360, PS3, Gamecube, Wii and in embedded architectures.

Itanium was a piker compared to POWER.

GPUs are somewhat VLIW aren't they?

As far as I know, they're SIMD, not VLIW. In VLIW, each ALU can be executing a different opcode. In SIMD, each ALU is executing the same opcode, but with a different operand.

The ATI cards were VLIW which gave them advantages in fixed pipeline, but as more and more programmatic shaders turned up ATI moved more towards CISC afaik.

GPUs are actually "single program, multiple data" (SPMD) machines, as far as programming is concerned. Internally, I believe they're implemented in a SIMD-like fashion, with extra hardware to handle the single program aspect within each lane.

DSPs are still a hugely important and evolving part of the computer architecture ecosystem, not to mention an important part of the market.

I'd be willing to bet your cellphone has 4-6 DSP cores in it built by 2-3 different companies.

cough take a look at the new Mill CPU arch http://ootbcomp.com/docs/ ;)

What reading do you recommend for someone who wants to learn about basic CPU design? I'm not a CS major, but I'm interested.

Computer Systems: A Programmer's Perspective, Bryant & O'Halloran. http://www.amazon.com/dp/0136108040/ http://csapp.cs.cmu.edu/public/pieces/preface.pdf

This is a unique blend of operating systems and hardware architecture, emphasising application programming over the system implementation approach in Hennessy & Patterson.

A great book, whose only downside is using AT&T syntax.

The old standby, Computer Architecture: A Quantitative Approach, works you up through the basics.

Past the Patterson/Hennessy books, Shen/Lipasti's "Modern Processor Design" was really helpful to get a better sense of the implementation details of real chips, especially when you move past the 'conceptual' views (e.g. Tomasulo) and start to ask how instruction schedulers or renaming or load/store buffers actually work.

Hold on, don't you mean Computer Organization and Design, by the same authors? The one with these pages:


They're different books. "Computer Architecture: A Quantitative Approach" is an introduction to the subject for people who will work in the area. "Computer Organization and Design" is for people who need to understand how processors and hardware systems work in order to do their own work. (Mostly.)

The preface to "Organization and Design" says basically this. For what it's worth, "Computer Architecture" is sitting on my shelf, and that's what I used in grad school. But based on their preface, I may buy "Organization and Design" because it may be a better reference for what I do day-to-day.

Not as rigorous as an academic textbook, and unfortunately ends with Core 2, but "Inside the Machine" by Jon Stokes (Hannibal of Ars Technica) is a good overview.

This article is spectacularly good. I wish I had this available when I started doing assembly-level optimizations on x86 chips. This knowledge used to be much more fragmented and difficult to learn.

VLIW could indeed be left out: you are not likely to encounter a VLIW chip, and if you do, it will come with an (excellent) compiler that will do most of the hard work for you.

A good followup article would be a tutorial on how to lay out your structures/arrays in memory given your access patterns and cache architecture.

In my experience, when you encounter VLIW cores, there is almost no tooling for it as they tend be application specific DSP-ish cores. In that case hand optimized assembly is the way to go as there is no budget to produce optimizing compiler of anything.

My experience was with Texas Instruments C6000 DSPs. The compiler is excellent and if you use it well, you rarely have to resort to assembly. Even then, you normally write linear assembly, not parallel assembly, letting the assembler take care of the rest.

An absolutely excellent article!

For everyone interested in the topic, you might enjoy the new Mill CPU architecture talks http://ootbcomp.com/docs/ - the very next talk is streamed live today (5th Feb, 16:15 PST http://ootbcomp.com/topic/instruction-execution-on-the-mill-... )

(I am a Mill forum mod; ask me anything about the Mill ;)

What market is the mill aiming for? Servers, phones, desktop or all of the above?

All of the above! Its a family of processors, all compatible running the same binaries but each with the belt sizes, vector sizes, functional units and so on that suit it.

So you'll get smaller Mills where that makes sense and absolute monsters in supercomputers, for example.

When you think about the "smaller" Mill that you might have in your phone and tablet, though, its a monster compared to today's desktops! Except in the power efficiency department, that is ;)

Have they indicated how they plan to get adoption?

The hackaday interview talks through some of the options on the business side: http://hackaday.com/2013/11/18/interview-new-mill-cpu-archit...

Probably a smaller market segment than any of those, since network effects are important. Maybe render farms or networking or top end embedded or such.

Very nice written article.

So how does ANSI/ISO C expose those details vs other languages, as many claim to?

No, and generally even the binary itself doesn't expose any of that, since you want the same program to run on your in-order atom and your OoO, multithreaded, Core iFoo.

The big exceptions being SIMD and VLIW, two similar forms of explicit parallelism. They more or less require programming language support to use fully, and most older languages are purely scalar though some can be used in a manner that is fairly amenable to automatic vectorization.

At the ISA level you're right, those do require explicit instructions (or bundling of instructions). At the programming language level no.

VLIWs can be programmed by normal compilers with well understood techniques (though variable memory latency makes static scheduling hard to do well in practice for most workloads). You just throw your C code into the compiler and it will spit out valid binaries and this has worked for as long as there have been VLIW machines.

The techniques for auto-parallelizing 'for' loops by compilers into SIMD instructions are a more recent development, but they certainly exist. The Intel C Compiler is particularly good at that, but Clang and GCC can do this too.

I did mention automatic vectorization. I know it exists, but it is not perfect even in Intel's implementation. Compilers simply cannot be relied upon to always find and take advantage of opportunities for parallelism in serial code. Having language constructs that explicitly express that parallelism helps ensure that the compiler at least tries to generate parallel code, where given straight C it might just decide that the loop body is too big to bother with or that the side effects are too complicated to prove it can be parallelized.

here's another good intro to CPU design, floating point math, linear algebra, PDE solvers etc http://www.tacc.utexas.edu/~eijkhout/Articles/EijkhoutIntroT...

Very good overview, thanks a lot for this article

hope that makes some people want to learn about some very optimization basics.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact