I do wish they covered trace caches, or as they're known partly in their more modern form, u-op (micro-op) caches, which are back in modern Intel chips again and cause some interesting performance artifacts. (The old trace caches of the P4 chips are different than the u-op caches of the new architectures, since the trace cache actually encoded branch predictions into the actual cache line lookup, which was always pretty wild.)
Itanium was a piker compared to POWER.
I'd be willing to bet your cellphone has 4-6 DSP cores in it built by 2-3 different companies.
This is a unique blend of operating systems and hardware architecture, emphasising application programming over the system implementation approach in Hennessy & Patterson.
The preface to "Organization and Design" says basically this. For what it's worth, "Computer Architecture" is sitting on my shelf, and that's what I used in grad school. But based on their preface, I may buy "Organization and Design" because it may be a better reference for what I do day-to-day.
VLIW could indeed be left out: you are not likely to encounter a VLIW chip, and if you do, it will come with an (excellent) compiler that will do most of the hard work for you.
A good followup article would be a tutorial on how to lay out your structures/arrays in memory given your access patterns and cache architecture.
For everyone interested in the topic, you might enjoy the new Mill CPU architecture talks http://ootbcomp.com/docs/ - the very next talk is streamed live today (5th Feb, 16:15 PST http://ootbcomp.com/topic/instruction-execution-on-the-mill-... )
(I am a Mill forum mod; ask me anything about the Mill ;)
So you'll get smaller Mills where that makes sense and absolute monsters in supercomputers, for example.
When you think about the "smaller" Mill that you might have in your phone and tablet, though, its a monster compared to today's desktops! Except in the power efficiency department, that is ;)
So how does ANSI/ISO C expose those details vs other languages, as many claim to?
VLIWs can be programmed by normal compilers with well understood techniques (though variable memory latency makes static scheduling hard to do well in practice for most workloads). You just throw your C code into the compiler and it will spit out valid binaries and this has worked for as long as there have been VLIW machines.
The techniques for auto-parallelizing 'for' loops by compilers into SIMD instructions are a more recent development, but they certainly exist. The Intel C Compiler is particularly good at that, but Clang and GCC can do this too.