Hacker News new | past | comments | ask | show | jobs | submit login
A Historical Look at the VAX: Microprocessor Economics (2006) (realworldtech.com)
52 points by lordgrenville on June 17, 2021 | hide | past | favorite | 18 comments



Someone here on HN recommended the excellent Computer Engineering: A DEC View of Hardware Systems Design[1], which covers much of the practical engineering & economics side of the VAX and its evolution through contemporary essays.

Reading between the lines, it was clear that DEC never managed to escape the gravity well of being born as a "module company", even as Moore's Law was inevitably pulling all modules of competitive relevance into the microprocessor itself.

[1] http://www.bitsavers.org/pdf/dec/_Books/Bell-ComputerEnginee...


Fascinating book. As far as I can tell these modules seem to be expansion cards, perhaps in external boxes? The book tries to explain them but I don't think it does a great job for those not familiar with big iron of the 70s.


> Some ISAs are more suitable for aggressive implementations, and some make it harder. The canonical early comparison was the CDC 6600 versus the IBM 360/91; the even stronger later one would be Alpha versus VAX. A widespread current belief is that the complexity, die cost, and propensity for long wires of high-end OOOs may have reached diminishing returns, compared to multi-core designs with simpler cores, where the high-speed signals can be kept in compact blocks on-chip.

Interestingly, this is also true within RISC implementations. One of the original goals of RISC-V was a simple, clean-slate design that would also avoid the pitfalls in existing ISA's that make 'aggressive' implementations hard.


I'm not sure that they succeeded: 32bit instructions aligned as 16bit makes decoding harder, I wonder if we're ever going to see a 8wide decoder RISC-V.. And I remember seeing here a comparison between ARM's SVE and RISC-V, I wasn't very convinced by the RISC-V vector's extension.


> I wonder if we're ever going to see a 8wide decoder RISC-V.

Considering there are 6-wide x86 cores on the market, I wouldn't be too concerned about RISC-V going wide. The instruction length decode is a 2-bit dependency chain; whereas the renamer-dependency logic builds off of 5-bit register values, which will be the bigger bottleneck.

Note that RISC-V's compressed instructions are trivially decoded with the first 2-bits; there is no further wonkiness about needing to decode the whole instruction before you can know it's true length that can be seen in other ISAs.


It'll be interesting to see how feature support vs the ISA extension system plays out for RISC-V. IIRC, the main reason for the 32-bit instruction vs 16-bit width is the compressed extension, which got stuffed into the "general" extension family. Having worked on a (simulated) RISC-V decoder myself, they stuck out like a sore thumb, in fact I think some microarchitectures literally pawn this extension off to it's own functional unit entirely.

It would probably be tedious—but not impossible—to write a tool which literally replaces the compressed instructions with their full length variants in an ELF (and then recalculates any offsets as needed).

I'd be very curious to see the performance tradeoffs between BOOM and some sort of big/little RISC-V microarchitecture on benchmarks compiled with and without compressed instruction support.


> which got stuffed into the "general" extension family

No it didn't. That's why Linux Distributions say they are RV64GC, for general+compressed.

The original encoding set was done by Andrew Watermen to be implemented in the minimal amount of gates. And he was successful.

So of course any change would stick out like a sore thumb. That however does not mean its bad. For reasonable performance code cores it is well worth it because you get many benefits, not just code size. There is a reason that all systems that expect reasonable performance have adopted the C instruction as a standard, decoding is tiny part of the chip at that point.

Even some very tiny cores prefer as the price of slightly more decoder complexity is worth it compared to the code size.

I suggest you go look threw the lectures in the RISC-V channel, I have seen a number of presentation that do such comparison if I remember correctly.


> 32bit instructions aligned as 16bit makes decoding harder

That's an optional extension, and the gains in code density make it worthwhile in many cases. It's certainly a lot better designed than many other variable-length ISA's.


Here's the real problem: DEC was doing hand designed processors to get a 20% improvement when everybody else could get 100% by waiting 18 months and riding Moore's law.

What made this possible? Chemechanical polishing.

Why? Because you suddenly had more than 2 layers of metal.

Synthesis systems suck rotten eggs when they don't have lots of routing resources.

Once you had 5+ layers of metal, standard cell synthesis systems are a NoBrainer(tm). You make alternate layers orthogonal and put power and ground on the topmost level.

At that point, anyone not riding the synthesis train was roadkill.


Ironic. CMP was developed and perfected at IBM, who used it to make complex low volume products.

Excellent insight. CMP and damascene metal (guess who!) really kicked CMOS into top gear.


> there is a tradeoff between committing the capital to own a fab and achieving higher clock rates, or preserving your capital, not owning the fab, and losing the ability to produce highly tuned designs. To give a rough notion of the trade-offs, in the same process size, the speed of a process (measured in FO4 latency) can vary by a factor of 3, between the best processes at foundries like TSMC or UMC, and fabbed manufacturers like Intel or IBM.

Is this still valid? How does this play out, say for Apple's M1?


I'm only adjacent to that part of the industry, but from what I hear process still maters _but_ the situation has changed in a couple interrelated ways:

- The number of players with top tier fabs has shrunk to three (TSMC, Samsung and Intel), everything above experimental volume on a <12nm feature size is coming off one of their lines. TSMC is all foundry-for-hire, Samsung does some of their own designs and some for-hire, and Intel is mostly their own products but currently expanding their foundry business. Strictly, Intel is a little behind at the moment because a couple of the process bets they made have proven difficult to get ready for volume production, but they're still ahead of everyone else.

Getting onto the hottest, newest process is still an in-house thing for Intel and Samsung, but for everyone else (Apple, Nvidia, etc.) it's basically a question of who offers TSMC the most lucrative deal.

AMD and IBM both divested their fabs (and hence most of the world's SOI process capacity since they were the two players heavily invested in that) into GlobalFoundries, who have since dumped most of IBM's old fabs to OnSemi. They aren't doing anything strictly cutting edge.

- The era this article was written in was right at the end of truly, holistically hand-tuned at the transistor level designs. The level of abstraction required to deal with the scale of modern chips is more reliant on standard-ish cells and automated design tools (provided at large cost by the CAD vendor and/or Fab owner to match the process being targeted) with some local optimization on performance critical parts. That kind of levels the playing field because it means everyone is getting some access to the fab-specific details through licensed cell libraries, pre-tuned IPs, process consultants from the fab owners and companies like Cadence, etc. and no one is really hand-tuning big logic parts.

- The hardware IP market has exploded. For example, among ARM processors, the magic in "Apple Silicon," other than paying for priority access to TSMC's 5nm capacity, is that Apple (and about 10 other companies) are building their own hand-tuned ARM cores on top of an Architectural License to which they add their own special sauce, while most players are licensing "off-the-shelf" cores (this is the "Cortex" brand) from ARM Ltd. and integrating them.

- I'm only talking about microprocessor-relevant fabs here, there are a handful of bleeding-edge hot shit fabs for processes suited to other things, like the fancy GaN fabs for high power/high frequency switching applications.


There's probably still -some- validity to it, but I think many lines are blurred.

For example, Intel '10nm' is actually pretty close to TSMC '7nm' as far as Transistor density, and at '10nm' is clearly superior from a density standpoint [0],[1].

However Intel has had trouble with their own 7nm node while TSMC is now at '5nm'.

[0] - https://en.wikipedia.org/wiki/10_nm_process#10_nm_process_no... [1] - https://en.wikipedia.org/wiki/7_nm_process#7_nm_process_node...


We had a VAX at U. Hartford as of 2005, not sure if it’s still running. By the time I got to college I already had several years of Unix experience and was confident I could use its command line. I found it very strange and tricky to use. One thing that stood out was that I could create multiple files with the same name and path, which felt very strange. I wish I had time to explore it further.


The versioning file system is awesome. It made rolling backups for you, and you could configure the number of them per directory. So a listing of text.txt would be something like (going from memory)

  test.txt;1   date/time   file size
  test.txt;2   date/time   file size
  test.txt;3   date/time   file size
  test.txt;4   date/time   file size
  test.bak     date/time   file size
This offered an intuitive way to undo mistakes. You could always include the version number as part of the file name


Interestingly, the lead architect of the VMS operating system, Dave Cutler was headhunted by Microsoft in the 80s to work on their new operating system - Windows NT.

There are a LOT of similarities between VMS and NT, under the hood. From memory, the process priority and scheduling systems are (or originally were) identical.


This article only analyzed the economics, but the original Usenet thread by the same subject-matter expert, CPU architect John Mashey, also analyzed the technical merits of x86, VAX and Alpha. An extremely long thread, but an in-depth analysis, reading it is strongly recomended: https://yarchive.net/comp/vax.html

Actually, John Mashey said he'll write about the technical aspects in Part III on RealWorldTech, but Part III was never completed.

> In Part III, I’ll sketch some of the tough issues for implementing the VAX, as best I can. In particular, I will note the ISA features that might make things harder for VAX than for x86, to do 2 or 4-issue superscalar, or a full blown out-of-order design. In particular, what this means is that you can implement a type of microarchitecture, but it gains you more or less performance dependent on the ISA and the rest of the microarchitecture. For instance, the NVAX design at one point was going to decode 2 specifiers per cycle, but it was found to add too much complexity and only get a 2% performance increase.

---

Interestingly, the relatively elegant and orthogonal CISC design of VAX makes the implementation of an out-of-order & pipelined CPU more difficult than x86, an unorthogonal CISC. There were real technical problems. Not to say they can't be overcame, but the combination of both the technical and economic factors made a competitive VAX CPU unfeasible.

Here are some highlights, please read the original link for the complete thread.

---

> Well, a few years later Dileep Bhandarkar, then employed at Intel, wrote a paper where he claimed (IIRC) that the performance advantage of RISCs had gone (which I did not take very seriously at the time); unfortunately I don't know which of his papers that is; I just looked at "RISC versus CISC: a tale of two chips" and it looks more balanced than what I remember.

That was another fine paper from Dileep, but the conclusion: X86 can be made competitive with RISC is not the same as: VAX can be made competitive with RISC

> >BOTTOM LINE:

> >

> >DEC had every motivation in the world to keep extending the VAX as long as possible, as it was a huge cash cow. DEC had plenty of money, numerous excellent designers, long experience in implementing VAXen. BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen...

> Looking at what Intel and AMD did with the 386 architecture, I am convinced that it is technically possible to design competetive VAXen; I don't see any additional challenges that the VAX poses over the 386 that cannot be addressed with known techniques; out-of-order execution of micro-instructions with in-order commit seems to solve most of the problems that the VAX poses, and the decoding could be addressed either with pre-decode bits (as used in various 386 implementations), or with a trace cache as in the Pentium 4.

You're entitled to your opinion, which was shared by the VAX9000 implementors.

Many important senior VAX implementors disagreed. I've posted some of the reasons why VAX was harder than X86, years ago. Of course you can do these things, but different ISAs get different mileage from the same techniques.

> Of course, on the political level it stopped being possible to design competetive VAXen, because DEC had decided to switch to Alpha, and thus would not finance such an effort, and of course nobody else would, either.

Ken Olsen loved the VAX and would have kept it forever. Key salespeople told him it was getting uncompetitive, and engineers told him they couldn't fix that problem, and they'd better start doing something else.

FUNDAMENTAL PROBLEM Certain VAX ISA features complexify high-performance parallel implementations, compared to high-performance RISCs, but also to IA-32.

The key issue is highlighted by Hennessy & Patterson [1, E-21]]: "The VAX is so tied to microcode we predict it will be impossible to build the full VAX instruction set without microcode."

Unsaid, presumably because it was taken for granted is:

For any higher-performance, more parallel micro-architecture, designers try to reduce the need for microcode (ideally to zero!). Some kinds of microcoded instructions make it very difficult to decouple:

A) Instruction fetch, decode, and branching

B) Memory accesses

C) Integer, FP, and other operations that act on registers

Instead, they tend to make A&B, or A&C, or A,B&C have to run more in lockstep.

It is hard to achieve much Instruction Level Parallelism (ILP) in a simple microcoded implementation, so in fact, implementations have evolved to do more prefetch, sometimes predecode, branch prediction, in-order superscalar issue with multiple function units, decoupled memory accesses, etc, etc. ISAs often had simple microcoded implementations [360/30, VAX-11/780, Intel 8086] and then evolved to allow more pipelining. Current OOO CPUs go all-out to decouple A), B), and C), to improve ILP actually achieved, at the expense of complex designs, die space, and power usage.

Some ISAs are more suitable for aggressive implementations, and some make it harder. The canonical early comparison was the CDC 6600 versus the IBM 360/91; the even stronger later one would be Alpha versus VAX. A widespread current belief is that the complexity, die cost, and propensity for long wires of high-end OOOs may have reached diminishing returns, compared to multi-core designs with simpler cores, where the high-speed signals can be kept in compact blocks on-chip.

IA-32 has baroque, inelegant instruction encoding, but once decoded, most frequently-used instructions can be converted to a small number (typically 1-4) micro-ops that are RISC-like in their semantic complexity, and certainly don't need typical microcode. As noted earlier in this sequence, the IA-32 volumes can pay for heroic design efforts.

The VAX ISA is orthogonal, general, elegant, and easier to understand, but the generality, but it also has difficult decoding when trying to do several operands in parallel. Worse, numerous cases are possible that tend to lockstep together 2 or 3 of A), B), or C), lowering ILP, or requiring hardware designs that tend to slow clock rate or create difficult chip layouts. Even worse, a few of the cases are even common in some or many workloads, not just potential.

As one VAX implementor wrote me: "it doesn't take much of a percentage of micro-coded instructions to kill the benefits of the micro-ops."

That is a crucial observation, but of course, the people who really know the numbers tend to be the implementers...

It is interesting to note that the same things that made VAX pipelining hard, and inhibited the use of a 2-issue superscalar, also make OOO hard. Some problems are easier to solve, but others just move around and manifest themselves in different ways. - decode complexity - indirect addressing - multiple side-effects - some very complex instructions - subroutine call mechanism

Following is a more detailed analysis, showing REFERENCES first (easier to read on Web), briefly describing OOO, and then going through a sample of troublesome VAX features, and comparing them to IA-32, and sometimes S/360. CONCLUSION that wraps all this together with DEC's CMOS roadmap in the early 1990s to show the difficulty of keeping the VAX competitive.


I think the thing that I've come to the conclusion with CISC and specifically x86 is that it's basically a better VLIW than VLIW. That's not to say there aren't opportunities with x86's memory model. To me the greatest advantage of RISC is it triggers my OCD less than x86 does... until I look at how CISCy ARM gets sometimes...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: