Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why did stack-based CPUs lose out?
122 points by optimalsolver on Dec 31, 2022 | hide | past | favorite | 86 comments
Back in the 80s, they seemed a viable alternative for the future of computing systems. The RTX2010 [0] found wide adoption in the spaceflight industry.

Since then, however, they seemed to have completely petered out.

Is there any specific reason?

[0] https://en.wikipedia.org/wiki/RTX2010




Stack-based CPUs didn't lose out to accumulator-based CPUs; the x87 for example was stack-based (and was originally intended to support the hardware stack transparently overflowing to RAM -- but they didn't write the software to support that before the chip shipped).

But both of them lost out to many-general-purpose-registers based CPUs, because that's what you need in order to write code which has many instructions in flight at once (for superscalar, pipelined, or both). If you look at the FPU on the Pentium you see the nightmare you run into otherwise -- optimized Pentium x87 code is half FXCH (swap the top of stack with something else) instructions because the FPU was pipelined and so you had to keep shuffling your stack in order to avoid pipeline stalls. The Pentium FPU has a hefty register-renaming unit in order to make this work; if it had been designed with general purpose registers rather than as a stack, that would have been avoided.


In 1993-4 I was an intern at Inmos, when they were trying to get the T9000 transputer to work.

The transputer was a stack architecture, with a bytecode-style instruction set. The stack was shallow, just big enough to evaluate a typical expression you would write in a high-level language. It also had a local addressing mode, relative to a “workspace” pointer register, which was used for local variables.

To make the T9 go faster, they gave it a “workspace cache”, which was effectively a register file. The instruction decoder would collect up sequences of bytecodes and turn them into RISC-style ops that worked directly on the registers, so the stack was in effect JITted away by the CPU’s front end.

A really cool way to revamp an old design; a pity that the T9 was horribly buggy and never reached its performance goals :-(


Not an EE or chip designer but

1. stack based arithmetic requires a stack (duh) whereas registers are more general use. Put another way, a stack is specifically for arithmetic whereas registers can be used how you like.

2. More importantly I suspect a stack implementation puts extra pressure on the register set (edit: NOT the register set, sorry, I meant those locations/hardware used for the stack) - read item 1 and item 2 then push the result back onto where item 2 was. With registers you can put the result into any other register, which allows other instructions to execute, not stall waiting for the result to appear on the stack.

3. A stack is a bottleneck, you can have multiple ALU units wich allows multiple instructions to happen OOO when you have registers. Try doing that with a stack (see 2 again)

happy to be corrected


"My history with Forth & stack machines" at https://yosefk.com/blog/my-history-with-forth-stack-machines... hits HN every year or more - https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... .

The most relevant parts are:

> We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it's too complicated.

> And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline – fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

> He said it sounded easy enough alright, we could do that. "It's just like my RPN calculator. How would you program it?" "In Forth!" ...

> It was among the most painful programming experiences in my life. ...

> Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

> After a while, the optimizing back-end's code wasn't any worse than my hand-coded Forth.


Since then... was a decade ago (not 30 years)

https://twitter.com/philae2014/status/427842417920712704

https://www.cpushack.com/2014/11/12/here-comes-philae-powere...

> Why was the RTX2010 chosen? Simply put the RTX2010 is the lowest power budget processor available that is radiation hardened, and powerful enough to handle the complex landing procedure. Philae runs on batteries for the first phase of its mission (later it will switch to solar/back up batteries) so the power budget is critical. The RTX2010 is a Forth based stack processor which allows for very efficient coding, again useful for a low power budget.

> Eight of the instruments are also powered by a RTX2010s, making 10 total (running at between 8-10MHz). The lander also includes an Analog Devices ADSP-21020 and a pair of 80C3x microcontrollers as well as multiple FPGAs.

However... more recently:

Insight - https://www.jpl.nasa.gov/news/press_kits/insight/launch/miss...

> The computer's core is a radiation-hardened central processor with PowerPC 750 architecture called RAD 750. This processor operates at 115.5 megahertz speed, compared with 20 megahertz speed of the RAD6000 processor used on Mars Phoenix.

https://www.engadget.com/nasa-perseverance-powerpc-750-17151...

> When NASA's Perseverance guided itself to the surface of Mars on February 18th, it did so with the help of the same processor that powered the 1998 iMac G3. You know, the colorful and bulbous all-in-one desktop that saved Apple's business and helped it become the company it is today. According to the New Scientist (via Gizmodo), NASA's latest rover features an offshoot of the PowerPC 750 CPU. That processor is about 23 years old in 2021.

The answer appears to be "more powerful processors have been designed that are radiation hardened.

That said... while the hardware isn't as common today, a lot of virtual machines are based off of a stack based architecture - from the JVM to Webassembly to the Ethereum EVM.


(some rambling questions here that are only semi-related:)

NASA is really anal about radiation concerns, insisting the caches have ECC (or they run it with cache disabled). For the instruction cache this makes sense (as well as any higher caches that would store instructions).

With modern CPU's having so many cores and threads being so cheap, couldn't NASA use these for redundancy?

My second question, how much more resilient are older processes with larger feature size compared to modern 6/7nm ? Compare modern with older (14nm/22nm finfet, 28nm/32 soi) and ancient (65nm or larger).


> With modern CPU's having so many cores and threads being so cheap, couldn't NASA use these for redundancy?

Power constraints and heat dissipation constraints.

https://arstechnica.com/science/2019/11/space-grade-cpus-how...

> BAE RAD5545 is probably the most powerful radiation-hardened processor available today. Fabricated in the 45nm process, it is a 64-bit quad-core machine clocked at 466MHz with power dissipation of up to 20 Watts—and 20 Watts is a lot. A Quad Core i5 sitting in a 13-inch MacBook Pro 2018 is a 28 Watt processor. It can heat its thin aluminum chassis to really high temperatures up to a point where it becomes an issue for some users. Under more computationally intensive workloads, fans immediately kick in to cool the whole thing down. The only issue is that, in space, fans would do absolutely nothing, because there is no air they could blow onto a hot chip. The only possible way to get heat out of a spacecraft is through radiation, and that takes time. Sure, heat pipes are there to take excessive heat away from the processor, but this heat has to eventually go somewhere. Moreover, some missions have tight energy budgets, and they simply can’t use powerful processors like RAD5545 under such restrictions. That’s why the European GR740 has power dissipation at only 1.5 Watts. It’s not the fastest of the lot, but it is the most efficient. It simply gives you the most computational bang per Watt. The HPSC with 10 Watt power dissipation comes in at a close second, but not always.


> Quad Core i5 sitting in a 13-inch MacBook Pro 2018 is a 28 Watt processor. It can heat its thin aluminum chassis to really high temperatures up to a point where it becomes an issue for some users.

Worth nothing here with Intel's Turbo Boost stuff this processor will draw a LOT more than 28W during normal operation - which makes this an awkward comparison.


45nm is massive for a high-performance processor. Rad-hardening compute is frankly overrated. Have multiple and compare state at a high-rate.


But you can't power multiple on the energy budget that you've got.

Perseverance is running on 110 watts ( https://mars.nasa.gov/mars2020/spacecraft/rover/electrical-p... ). The RAD5545 uses 20 watts of that budget. Tripling it means it doesn't have enough energy to power the rest of the rover.

The Mars MAVEN ( https://spaceflight101.com/maven/spacecraft-information/ ) is using a RAD-750 which uses 10 watts of power ( https://en.wikipedia.org/wiki/RAD750 ) (also used in the JWST).

The RAD-750...

> The processor can endure radiation doses that are a million times more extreme than what is considered fatal to humans.

And while you may argue "yes, that seems a bit excessive" - the goal isn't "it can run in those extremes" but rather...

> Also, RAD750 will not suffer more than one event requiring interventions from Earth over a 15-year period.

> “The RAD750 card is designed to accommodate all those single event effects and survive them. The ultimate goal is one upset is allowed in 15 years. An upset means an intervention from Earth — one ‘blue screen of death’ in 15 years. We typically have contracts that (specify) that,” said Vic Scuderi BAE Business Manager.

... because it can't be replaced easily.

From Wikipedia:

> The CPU can withstand an absorbed radiation dose of 2,000 to 10,000 grays


I do not buy the power argument. A modern chip is orders of magnitude more power efficient, especially at low voltage. You could easily have multiples of these within the same power budget.


But smaller size increases the chance for event (which might be compensated) but also increases chance for fatal damage to happen.

The radiation in space is not dangerous because it just produces a bit of current that can flip a bit; it is dangerous because it is strong enough to bump a bunch out of their position in silicon crystal, causing permanent damage that will accumulate over time. And when your feature size is starting to measure in <100 atoms it would be easy for single high energy event to break whole core.


i guarantee the people building the spacecraft and choosing the CPUs have thought about this more than you, and have much more knowledge about the problem space than anyone here questioning their decision-making processes.

if it was a good idea to have multiple redundant CPUs on these spacecraft, the spacecraft would have multiple redundant CPUs.


I will note that they do have multiple, redundant CPUs... they just aren't all active at once.

https://www.jpl.nasa.gov/news/press_kits/mars_2020/landing/m...

> Just like Curiosity, Perseverance has two main computers, or “brains”: One is active at any given time, while the other serves as a backup. Both use radiation-hardened RAD750 computers, and together, they’re referred to as Rover Compute Element A and B.

> Another RAD750 serves as a third “brain,” on Perseverance. Called the Vision Compute Element, it is equipped with a special card for analyzing images. This card has a Virtex-5 field programmable gate array (FPGA) and is the driving force behind Terrain-Relative Navigation and the Lander Vision System, which analyzes images of the landing site during descent and compares them to an onboard map to determine the rover’s position relative to the ground.

That "Virtex-5 FPGA" is an interesting one too...

https://www.xilinx.com/products/silicon-devices/fpga/virtex-...

Where it is listed as "Space-grade Virtex-5QV FPGA"


I said I do not buy the power argument and I received two high quality comments explaining issues beyond power consumption. Your appeal to authority adds nothing of value.


The modern chip is more power effect by using traces that are closer together and thinner.

In the context of radiation hardening by design, this means that a cosmic ray strike could evaporate part of the 5nm trace or cause arcing to the next trace.

A lower voltage chip can be more easily permanently damaged by a cosmic ray. That's why the RAD750 has its radiation resistance measured in grays.

> The gray (symbol: Gy) is the unit of ionizing radiation dose in the International System of Units (SI), defined as the absorption of one joule of radiation energy per kilogram of matter.

This is the total absorbed dose - not the flux.

Additionally, this is for the entire system - not just the CPU. It's not just "here is a redundant CPU" but also "here is the entire system that the CPU is part of" that is redundant.

The JWST is using a RAD750 with 44 MB of ram... and yea, that's not a lot, we're not trying to ship super computers out there. The alternative that you're suggesting is suggesting putting those all in multiple out in space and checking them against each other constantly while the system is in use collecting data.

The power budget is slim and also coupled to the heat dissipation budget. The JWST has a tighter heat dissipation budget than power budget... and the Perseverance has a tighter power budget (total rover power budget is 110 watts).

The challenge I'd pose is "what other processor arrangement fits within its power budget?"

I'm also going to point out that I got the numbers wrong for Perseverance's power budget for its CPU. The single board is 10 watts, the cpu is 5 watts.

https://en.wikipedia.org/wiki/RAD750

> The CPU can withstand an absorbed radiation dose of 2,000 to 10,000 grays (200,000 to 1,000,000 rads), temperatures between −55 °C and 125 °C, and requires 5 watts of power. The standard RAD750 single-board system (CPU and motherboard) can withstand 1,000 grays (100,000 rads), temperatures between −55 °C and 70 °C, and requires 10 watts of power.

https://mars.nasa.gov/mars2020/spacecraft/rover/brains/

> Unlike people and most animals, the rover's brains - its computer - are in its boxy body. The computer module is called the Rover Compute Element (RCE) - there are actually two identical RCEs in the body so there is always a spare "brain."

Also - https://space.stackexchange.com/questions/50470/what-makes-i...

These are modern chips that are designed for that efficiency - it's just that there are other constraints too. You are unlikely to find a lower wattage processor that you can run with redundancy that is able to tolerate space travel.

https://link.springer.com/article/10.1007/s12567-016-0138-0

Radiation Effects and COTS Parts in SmallSats https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=2...

Note that the cubesat has an average lifespan of about a year.


Power arguments is interesting. Triple redundant 5W processors and 5W more for the redundancy support circuits(must not fail)? Sounds cheaper and easier to just slap on that 20W part, unless that 20W part is so expensive that it outweighs engineer man-hours for the 5W x 3 solution, but I can't see why it would unless said engineers are massively under-paid.


I'm going to correct a previous part - the 20W appears to be the budget for the system.

The RAD750 is a 5W processor and on a motherboard, is a 10W system.

https://en.wikipedia.org/wiki/RAD750

> The CPU can withstand an absorbed radiation dose of 2,000 to 10,000 grays (200,000 to 1,000,000 rads), temperatures between −55 °C and 125 °C, and requires 5 watts of power. The standard RAD750 single-board system (CPU and motherboard) can withstand 1,000 grays (100,000 rads), temperatures between −55 °C and 70 °C, and requires 10 watts of power.


You can have one disruptive rad event every 10 minutes if you compare and correct state between three or five computers at something 100 Hz.

And a more recent computer that isn't designed for rad-environments will automatically be way more efficnent than a rad-hardened one.


Yes, but you have to compare the complete state of each computer for this to work, including every bit of memory & cache. You can't have much RAM if you're planning to compare them in a few milliseconds.

If you try to do this without comparing the whole state, and only comparing the outputs of some control algorithm between the computers, one of them could have some internal state corrupted and remain unnoticed for a long time, long enough for other corruptions to happen on the other computers, and then you have no straightforward way to figure out which one to trust and which to reset.


What set of processors would your recommend putting in Perseverance with a 110 watt power budget? Or on JWST with its heat dissipation requirements?


> couldn't NASA use these for redundancy?

> how much more resilient are older processes with larger feature size compared to modern 6/7nm ?

If I may paraphrase your questions, I think you are trying to ask "isn't rad hardening a bit overrated? are space chips that good?" - and a Wikipedia edit[1] that has kept most of its original form for more than 15 years have been answering that question thanklessly:

> Hardened chips are often manufactured on insulating substrates instead of the usual semiconductor wafers. Silicon oxide (SOI) and sapphire (SOS) are commonly used. While vanilla chips can withstand between 5 and 10 krad, SOI and SOS can survive doses many orders of magnitude greater.

So the answer is, if 10krad(in lifetime) and [unspecified] events/seconds is enough for your requirements, maybe you don't need rad-hard chips. If your requirements are going to be magnitude grater(you can math it from data), then you might want it, and yes, there are differences in orders of magnitudes.

1: https://en.wikipedia.org/w/index.php?title=Radiation_hardeni...


>With modern CPU's having so many cores and threads being so cheap, couldn't NASA use these for redundancy?

Nasa needs redundancy. Say we ran a program in triplicate, one on each core, and compared the outputs. We still don't have redundancy here because the chip can still fail in a way that will cause problems for all three cores.

>My second question, how much more resilient are older processes with larger feature size compared to modern 6/7nm ? Compare modern with older (14nm/22nm finfet, 28nm/32 soi) and ancient (65nm or larger).

Larger gates tend to be more resilient to radiation. I don't know exactly why. The fabrication tech is also important. Some types of chip manufacturing processes results in more resilient chips. I'm sorry but I don't know enough about it to elaborate further.


Not an expert but essentially radiations can produce electrons in random places flipping one or more transistors unexpectedly.

Larger transistors need more electrons so they are slightly more resistant.


THe other problem is that it can bump the atoms permanently off the silicon lattice, so permanent damage. Bigger transistor can just survive more hits


That's single event upset (SEU). The other radiation issue is total dose. That's when charge gradually builds up in insulating gates due to ionizing radiation. It's permanent, and eventually reaches the point that the transistor is always on so has a permanent fault. One of the factors is collection area, so larger gates are not less susceptible because they have a larger collection area. But if they run at a higher voltage that can make them more tolerant to total dose.


Because larger transistors need more electrons, they also need more power to switch. I guess that circuitry made from valves/vacuum tubes might provide high radiation resistance?


I mean, you could just use big transistors


> With modern CPU's having so many cores and threads being so cheap, couldn't NASA use these for redundancy?

Bold No.

CMOS technology have one drawback - it is prone to latch up by ionization, because it have The parasitic p-n-p-n thyristor (it is locked in wrong state, to reset must do full power off).

Rad-hardened designs have to deal with this, they used special type of die material and clamping diodes, so their technology essentially different from civilian designs.


> With modern CPU's having so many cores and threads being so cheap, couldn't NASA use these for redundancy?

Maybe? But CPU cost basically doesn't matter compared to launch cost. Power budget given the required computing needs during critical sections of the mission is probably the biggest priority after reliability. Modern cpus are good at compute/watt, but not necessarily at low watts at low load.

No idea re: feature size, for cosmic rays and dram, it was thought that shrinking beyond some point was going to increase failures dramatically, but then it was fine. But in space radiation is more than cosmic rays.


> With modern CPU's having so many cores and threads being so cheap, couldn't NASA use these for redundancy?

Think shuttle had 3 separate computers running same code and having mechanism to vote out the one giving bad output ?

There are also some automotive ARMs that are basically 2 cores shifted physically 90 degrees running lockstep and some logic to reset if they disagree

> My second question, how much more resilient are older processes with larger feature size compared to modern 6/7nm ? Compare modern with older (14nm/22nm finfet, 28nm/32 soi) and ancient (65nm or larger).

I'd imagine by a lot. Think of the charged particle basically a pulse of charge. The bigger transistors are the bigger capacitance are and the more charge you need to flip anything. Any potential damage it could cause to the chip would also increase. I didn't found anyone trying to make "same" chip in significantly different process.


HotSpot VM is only using a stack for storing local variables when executing byte code. As soon as it needs to optimize a hot path it turns the byte code into register-based cpu-specific assembly.


Very true... but its still a virtual stack machine.

This gets into the "when you write the implementation or JIT" it is much easier to write something that doesn't use any registers and then target it to a machine that uses some number of registers than to write the VM for something that is register based, but then has different conventions for how to use those registers than the VM has...

This is something that had to work on a 386, Sun Sparc, MIPS, Motorola 68030, Alpha, Vax (the CISCest instruction set you'll ever find https://documentation.help/VAX11/op_POLY.htm )... and even its own Java Processor ( https://en.wikipedia.org/wiki/Java_processor - https://www.usenix.org/legacy/publications/library/proceedin... ).

The difficulty would be if you made a... 32 register system and then you had to do the JIT from that to MIPS, but the frame pointer was foreign to the VM and $s0-$s7 where not ones you could touch without extra work... and then x86 had different conventions based on who wrote the compiler ( https://en.wikipedia.org/wiki/X86_calling_conventions#List_o... )...

It becomes much easier to work with a system with no registers and then make it "easier" to do the JIT than to do registers and then go from "too many" to trying to redo them for that system.

And while it's not something I follow at all, apparently the EVM is also stack based. https://medium.com/mycrypto/the-ethereum-virtual-machine-how...

> The EVM uses a 256-bit register stack from which the most recent 16 items can be accessed or manipulated at once. In total, the stack can only hold 1024 items.

Which is... kind of interesting. Reminds me somewhat of dc - https://en.wikipedia.org/wiki/Dc_(computer_program) where its stack based, but has "registers" too (though you tend to use registers for storing macros which can then be loaded onto the stack and executed).


Because unless you want to write everything in Forth, register-based architectures are faster and easier to generate optimized code for.


Then why are stack based VMs so much more common than register based ones?

A bit of an appeal to authority here I know, but Eric Lippert felt stack machines were easier to generate code for:

A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well.

https://learn.microsoft.com/en-us/archive/blogs/ericlippert/...


If you aim to ship your VM on machines with varying number of registers, and you pick a VM with R registers, you have to write code to make it run on machines with fewer than R registers and code to make it run on machines with more than R registers.

Both are simple if you’re willing to give up performance (map VM registers to memory for the first case, don’t use more than R registers for the second case), but that means giving away a lot of performance. So, you’ll essentially be working on two different optimizers, with each only affecting some of your target CPUs.

If, on the other hand, your VM has zero registers, you only have to write the code for running efficiently on machines with more than zero registers. That’s conceptually easier, and (loosely) means every improvement you make will improve things on all target CPUs.

Also note that single static assignment as used in LLVM (not really a VM) somewhat uses the opposite approach: it assumes there are infinitely many registers, and then adds code to map that down to whatever is available on the target hardware.

Going back to the question: I think stack-based CPUs lost because all their instructions contend on access to the stack, making it hard for the hardware to pipeline instructions and/or execute them out of order.


As a hobbyist I agree that generating code for stack machines is much easier than for register machines. Because you get to ignore register allocation entirely! I hate register allocation (implementing it I mean).

But it's normally agreed to be less efficient. See Y Shi's 2005 paper Virtual Machine Showdown:

> We found that a register architecture requires an average of 47% fewer executed VM instructions, and that the re- sulting register code is 25% larger than the correpsonding stack code. The increased cost of fetching more VM code due to larger code size involves only 1.07% extra real ma- chine loads per VM instruction eliminated. On a Pentium 4 machine, the register machine required 32.3% less time to execute standard benchmarks if dispatch is performed using a C switch statement. Even if more efficient threaded dis- patch is available (which requires labels as first class values), the reduction in running time is still around 26.5% for the register architecture.


Is that the paper where they pop twice for a 2 -- 1 operation?


Can someone ELI5 (or at least ELI20): Why does Forth do better on a different CPU architecture than other languages? Or, why does Forth do worse on the architecture that most other languages do fine on?


Forth is a stack-based language, so it does really well on a stack-based CPU. Optimizers for common languages like C perform analysis on which values in a function are more commonly used, and want to control whether those values live in registers or RAM; register based CPU ISAs can give them more fine-grained control.

But if you're working in Forth, everything is stack ops anyway, you wouldn't benefit much from that kind of analysis, and really a stack-based CPU is going to give you the most speed because the most common operations boil down to a single instruction per.

Chuck Moore doesn't mind this because he writes literally everything in Forth. Most of us who work in other languages would mind the performance hit.


> Forth is a stack-based language, so it does really well on a stack-based CPU.

But IIRC so is JVM (Java), CIL (C#), and Python's bytecode, so why did op single out Forth and omit those?..


And webassembly. I wonder if we'll see dedicated wasm accelerators at some point.


Is there a need, with WASM implementations generating native code? Jazelle tried that for the JVM on ARM, but I also get the impression that compilers made Jazelle unnecessary.


tbf on Moore's CPUs, stack ops boil down to a fractional instruction. (because there are multiple ops per architectural I-space word)


I think Burroughs mainframes also used stack architectures. I suspect the architecture wasn't entirely responsible for Burroughs' demise, but they seem to be an outlier.

I imagine one potential problem with stack architectures is running out of stack memory and having to spill to main memory. (Presumably using main memory for the stack would be too slow.) This seems unlikely for a single thread but I could imagine it being a big issue with thousands of threads. Context switches where you have to save the whole stack to memory are going to be more like cache misses or page faults.

Another potential issue might be dealing with data that isn't on the top of the stack. A solution could be to add a deep stack addressing mode but then you're basically mixing in a register machine and the simplicity and orthogonality might be lost.

The interesting thing to me is that several stack-oriented interpreters, compilers, and runtimes (including Forth, Pascal, Smalltalk, Java, and WebAssembly systems) have been developed, so maybe a stack-based CPU might be a good match for those systems.


The Burroughs mainframes stored most of their stacks in main memory - they have a couple of deep stack addressing modes (and a proper set of display registers).

Alternately 29Ks and Sparcs have register windows which are essentially top of stack caches.

I think that the RTXs were a thing at a particular point of time that made sense before caches moved on-chip, they didn't catch because they expected you to buy into the forth world-view (rather than providing C and other languages of the day)


Stack-orientation makes sense for interpreters, because decode is expensive in software. Hardware has different tradeoffs.


The theory that stacks are better than registers for an interpreter was pretty conclusively disproven by the Lua 5.0 interpreter. See https://www.lua.org/doc/jucs05.pdf (summary: although decode is slightly more expensive per opcode, you have half as many opcodes to decode on average because you get rid of all the stack manipulation instructions)


Almost all problems have data dependencies as graphs rather than stacks. e.g.

  Sandwich = egg + bacon + bread
All of egg + bacon + bread are almost always independent, and can be computed independently. A graph a.k.a. a register architecture enables this independence to be represented

So register architectures are more complicated, but usefully so


> data dependencies as graphs rather than stacks...

> ...A graph a.k.a. a register architecture...

This is a new thought for me. What resources would you suggest to help me better understand how "register architecture" better supports modelling data as graphs?

* edited for better formatting


Maybe

https://en.m.wikipedia.org/wiki/Use-define_chain

SSA form might be a clearer expression of this idea, since there variable assignments more tightly conform to the idea of nodes in a graph, since they are assigned to once, and so produce a clear graph of the whole program

If we just looked at registers (rather than SSA variables), then register reuse for different values would be more complex, since then the nodes of the graph would mean different things as the register is reassigned (assuming we were going for that representation)

Another way to think about it would be as relations. For instance assembly ops like MULT ADD can be seen as ternary relations, and as you join relations togother into an expression, such as a+b*c, you are forming a graph

This graph like correspondence is used to good effect in BDDs and ZDDs, which use that correspondance to compress the representation of expressions

Part of the reason behind this is because nodes of a graph are restricted in the same way registers are -- they are only refered to by name, never by an offset. Whereas for stacks, they are refered to by offset, never by name

We could remove this separation by naming the top 16 (say) offsets of a stack. Then registers could be referenced by name and by an offset index

This would allow lots of fun things. If at the software level, it could mean having normal named variables, aswell as #n offset variables. Which might be a great addressing mode for assembly too. It is done for main memory with Base+Offset addressing, but not for registers

The graph equivalent might be Register Direct. Or others, but at this point we're expanding to think beyond registers to memory itself

Another path you might like to explore could be the G Machine, or Spineless Tagless G Machine which (along with many other compilers) represents data as a graph, to ultimately convert it into registers


Are you talking about parallel computation?

Because serially:

    def sandwich():
        egg()
        bacon()
        bread()
is pretty much where stacks shine.


Yes :) it wasn't the clearest example. But is also slightly hard to express

I think superscalar architecture for hardware benefits from an explicit graph of dependencies. Then equally the user benefits from reduced need to convert the stack to a graph with SWAP DUP etc -- which is one way to think of the role of those operators (as a stack to graph conversion)

On the otherhand some problems are unrepresentable by a graph, and easily represented by a stack; for instance map, reduce, and a bunch of others

What I'm suggesting is that both the hardware and the user and arguably even the problem, generally prefer graphs


This is just economy question - registers are faster than stack, for similar price.

In 80th, transistors on VLSI where expensive, in rad-hardened design where prohibitively expensive, and stack gives possibility to make same Turing completeness as with many registers command.

So some time stack architectures performed well in microcomputers, than their area shrink to narrow-special things, and after many registers become accessible for all, stack architectures become history.


Building a solid computer stack (hardware through software) is an exercise in mechanical sympathy, understanding the traits of the underlying components which enable the practitioner to get as much out of it as possible. The software should do the thing that the software is good at, and the hardware should do the things that the hardware is good at, and neither exists in a vacuum, both should be aware of the constraints of the other.

For one example, about 25 years ago Intel made a processor called Itanium which used 'VLIW' (very long instruction word) instructions as opposed to x86. In this case, a lot of the complexity of operating a processor is shifted into the compiler, since individual CPU instructions may perform complex operations. In a modern processor, instructions are 'simple', and the processor is responsible for finding the data dependencies that allow multiple instructions to be executed in parallel (this is called 'superscalar' processing). With Itanium, the compiler author can encode such individual operations into more complex instructions which can be executed in parallel, all at once. So, where in x86 I might have in 3 different instructions (add r1, r2, sub r3, r4, mul r5, r6), with Itanium I might be able to combine all of those into a single instruction at compile time. By doing this, in theory we can remove a lot of the complex superscalar logic from the processor, devote more of the chip to execution units and achieve more overall performance.

One problem here is that the compiler may not actually know what parallelism is possible. For example, the latency of a given instruction may vary from chip to chip. The number of execution units for a single operation may vary. Our code is now very highly optimised for a particular CPU, we may not see natural speedups as newer processors are made, newer processors might not even support the code we've written without a translation layer. Taken to extremes, we lose the universality of software (the ability to change the hardware without rebuilding the software). VLIW chips are viable, but they're typically used in more isolated use cases (like DSPs) where it's reasonable to recompile the software for every chip (and where perhaps there is lots of parallelism in the problem space).

State machines essentially have similar issues. High performance processors tease out the data dependencies between instructions in order to perform superscalar processing, which is key to good performance in a modern processor. Stack machines make it very difficult to understand data dependencies because in principle all data depends on all other data, a post-processing step is required (which is considerably more expensive than the processor equivalent of converting to SSA form). Modern register machines rely heavily on register renaming to ensure that both superscalar processing and pipelining work well with each other - maintaining similar performance with a stack machine would lead to similar solutions, and so very likely switching from a register machine to a stack machine would likely look like having a translation layer on the chip, from the stack machine back into a register-like language.

Stacks do exist in a few places these days. The JVM clearly does it - Java is 'register oriented' for the programmer, the JVM instruction set is stack oriented, and the runtime converts it back into a register oriented machine code. Another example is the ARM Jazelle extensions which allow some ARM processors to directly execute Java bytecode. Here, the bytecode is converted into the machine code as an extra stage in the processor pipeline, adding overhead.

BLAB: - There isn't much advantage to stack machines. It's relatively straightforward to translate between the two. - Modern processors are architecturally closer to register machines, and a lot of the techniques used to make them fast would have to be replicated on a stack machine. - Because any decisions at this level of the stack are effectively invisible to the user in modern tech, it's beneficial to go for the simplest solutions to any given problem, which here is to use a register machine at the hardware level.


The (disconcertingly silent) Mill CPU company have a design talk that covers these concepts wonderfully in the general form, while explaining their various innovations.

e.g. this video describes how their VLIW compiler does partial compilation/'specialisation' to allow the same input to be consumed on machines with different variations of hardware features.

https://youtu.be/D7GDTZ45TRw?t=259


The sun will have burnt out before the Mill CPU actually ships, sadly. But at least that takes care of some radiation concerns...


I'd expected them to at least produce FPGA prototype or something


It would have been interesting to see a superscalar ia64.

The biggest problem with ia64 is that virtually every generation shipped late. Very late. We had working first generation CPUs years before anything shipped publically. The first ia64 had a huge amount of die space wasted by a worthless Pentium III core (if my memory is correct), which would have been far better used on cache.

For some workloads ia64 was fast. Networking benchmarks on my development machine were impressive because of having gobs of memory bandwidth and fast interrupts. The asynchronous register window engine really helped some things. Sadly, once em64t took off, it was a dead architecture. x86 is the only platform that gets the real resources at Intel.


Ia64 was already superscalar. That's the whole point of the Explicitly Parallel part of EPIC.

Maybe you meant out of order?


I might be wrong but it seems a little suspect to call Ia64 superscalar because of that. When I hear superscalar, I generally expect to see a processor dynamically checking data dependencies at runtime, and perhaps executing multiple instructions depending on those data dependencies. With out of order execution added on top, I'd additionally expect to see a structure like a reorder buffer to allow results to be written to RAM in the same order as the logical execution.

I think that encoding the parallelism into the instructions would generally not be considered as superscalar, it's an entirely different technique. Most texts I've read would explicitly separate out VLIW and EPIC from superscalar processors. There are plenty of ways of getting parallelism - for example one could probably view SIMD or even some CISC instructions as a form of this - as far as I know, VLIW and superscalar processors are just considered different from this perspective.


> When I hear superscalar, I generally expect to see a processor dynamically checking data dependencies at runtime

No. They use brute force, they have as many state machines as their superscalarness, and these SMs working simultaneous, saving results in OoO buffer, and periodically synchronize via special algorithm.

Each SM have full set of shadow registers and state of CPU.

To be precise, exists examples of non-orthogonal or "partially" OoO machines - Pentium MMX and IBM POWER. In them, CPU core divided to few parts, MMX parts dividing registers file, POWER have separate pipeline for memory operations, and these parts could do only in their limited space, but other mechanics is same as ideal orthogonal OoO design.

For example, when happen branch and unknown which side will run and have spare resources, SS running each sides (in separate OoO pipeline), and when will know finally effective side, results of other side just discarding, and results of effective side merging with other calculated data.

If happen branch, but resources limited, works predictor, which learn on previous runs and usually could predict with >80% accuracy, which side will run, so most probable side chosen.

When just happen spare resources on run, got last updated PC+1 and run other SM in parallel from there, than on some place (mostly on buffers fill), all data from OoO exec merging with main flow.

All this mean, carefully crafted program could see difference of OoO vs non-OoO execution (not exactly in memory, but by measure jitter, as merge OoO pipeline takes some time), but if not care, mostly have excellent compatibility and if fortunate, OoO will run as many times faster as number of OoO pipelines, which could be significantly large, ie in Pentium-3/4 was 4 or more pipelines.

And as side effects, this all gives high pressure on cache subsystem, and sometimes optimized in unsafe way (but cheap), so happen vulnerabilities, like Meltdown and Spectre.


https://en.m.wikipedia.org/wiki/Superscalar_processor

It very much is superscalar. OoO Superscalar is what you described.


I’m not sure it is. For comparison, Arm A8 is superscalar (can issue two non-conflicting instructions per cycle) whereas the A9 is out of order superscalar. I think the linked Wikipedia article supports me, no? My understanding is that a structure like a reorder buffer is what makes the processor OoO.

> The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU): > Instructions are issued from a sequential instruction stream > The CPU dynamically checks for data dependencies between instructions at run time (versus software checking at compile time) > The CPU can execute multiple instructions per clock cycle


That part is bit weird because checking data dependencies is precisely what the OoO does. If that was strict part of what superscalar is then OoO moniker would be redundant. If one doesn't execute things out of order then checking for dependencies is useless.

However above that is the simple definition, which is that as long as it executes more than a single instruction per clock it's superscalar. Even SIMD is taken as an example of a superscalar CPU.

As for the original claim the ia64 is very much a superscalar CPU. All VLIW designs are superscalar. VLIW can be also thought as OoO superscalar CPU with the reorder buffer and dependency analyzer ripped off and exposing the execution units explicitly in their gory details. Or alternatively we can also say that VLIW already won, we just added a big honking chip on top to JIT compile machine code into VLIW micro-ops.


As everything, it is complicated.

Even in order non-superscalar cpus need some kind of dynamic checking if they allow out of order completion by allowing succesive low latency instructions to execute under the shadow of a preceding high latency instruction [1]. I'm not an cpu architect but this tracking is much simpler than the register renaming of OoO and only relies on hardware interlocks.

I think the grandparent is right in distinguishing VLIW, especially exposed pipeline ones, from superscalar as they have no tracking at all and just naively issue bundles; I think it is an useful distinction.

EPIC is more complicated as while it allows expressing intrabundle parallelism, it also allow dependencies and so it does need hardware interlocks. You could argue either way.

I think SIMD by itself should not be considered superscalar as it is still executing a single instruction is a single execution unit (compare the term superscalar itself to vector computation). Of course a superscalar CPU could have the capability to issue distinct SIMD instructions in parallel (for example larrabee).

CPU microarchitecture details are fun!

[1] there are many examples, starting with the CDC 6600; quake was also famous for implementing texture perspective correction by scheduling a division every 16 pixels and using fast approximation for the remaining pixels.


> ia64 is very much a superscalar CPU

Yes, but ia64 was not pure VLIW design, it was EPIC - VLIW with additions. So compiler could suggest optimal path of execution, but if suggestion was not created, it using OoO (in reality things sophisticated, but on first look, something like this).

Other wide known VLIWs are nearly all ultraspecialized pure designs, without any OoO, because EPIC is very expensive in means of transistor number and cost of large die.

Example, AMD TeraScale GPU architecture, Radeon HD 2000 series, Radeon HD 3000 series, Radeon HD 4000 series, Radeon HD 5000 Series, Radeon HD 6900 series. https://en.wikipedia.org/wiki/TeraScale_(microarchitecture).


I forgot to type the OoO -- that was indeed what I meant. That's what I get for being tired and overworked doing storm repairs for the past 2 and a half weeks.


Full software scheduling breaks down on the first data cache miss. Combined with typically much larger icache foot-prints, VLIW just never lives up to the promise.

I agree they could be useful for DSPs because the icache problem isn't as severe as general purpose software like a compiler or word-processor.


I think it's a memory bandwidth thing - a shift in the ratio between on and off chip access penalty. One could imagine though, an implementation with a lot of on-chip topN-of-stack and renaming, but I'm not aware of such.


It's not memory bandwidth entirely. Modern superscalar Out-of-Order CPUs rename architectural registers with some number of physical registers to make larger OoO windows useful. To accomplish the same thing on a stack based architecture, you now have to rename memory addresses so that otherwise independent instructions can be reordered. The tags for registers are maybe 8-9 bits in a modern CPU, whereas full 64 bit tags would be needed for a stack based CPU (maybe you could get away with 48 bits on x86-64). That means more bits to compare in critical CPU blocks, which means more area and more power.

You could relieve that pressure by doing register windows (like SPARC or ia64) that aren't in "memory", and those architectures did have performance benefits for certain workloads (I remember working on i960 where interrupts were shockingly fast for the number of registers it had), but those architectures never really took any significant share of the mainstream market. Architecture matters far less in modern CPUs so long as it doesn't do anything that makes it "too hard" to build a good superscalar OoO core.

All that said, nobody has ever invested the kind of money put into x86 and ARM to make a stack based CPU go fast, so maybe the problems above are not insurmountable. But that's not the universe we happen to reside in.


You do not need to resolve/rename memory addresses. You need to resolve/rename stack addresses, which is entirely different problem, much smaller than working with memory addresses. There are usually not more than 2 stack addresses to rename, even less than number of registers.

If I remember correctly, Elbrus 2 CPU had external stack ISA and internally it was register-based OoO core. The Elbrus team was bought wholesomely by Intel after USSR collapse and one of the rumors behind the name of Pentium chip line name was that it was named after Pentkovskij: https://en.wikipedia.org/wiki/Vladimir_Pentkovski


That, and stack-based cpus excel at code density, a metric which has not really been an issue for some time.


Modern machines are optimized to run C which is in fact essentially "stack based" but registers "cache" parts of the stack (often never touching the stack in leaf procedures).

There were at least two well known machines that tried to mix the stack and registers - Sparc and my personal favorite, the am29000.

Intel was so dominant in manufacturing for so long that the objectively kind of crappy x86 ISA just didn't matter for price/performance (and of course was a plus for running legacy software).


Mainly familiarity I'd say. Realise that as early as ENIAC we've used accumulator-based ALU architecture, whereas the first Forth chips started coming out in the eighties. It's hard to overcome such momentum, even if the alternatives have some advantages.

Vaguely related: https://users.ece.cmu.edu/~koopman/stack_computers/index.htm...


40 years ago memory was faster than the CPU, so using the stack to hold operands incurred no speed penalty.

More important now is that registers are global resources, and global resources are a nightmare in multi-threaded code. Stack machines don't have global resources (except a stack pointer and ALU which are easy to manage).

But then CPU and registers got fast and register machines began to predominate. Now you have to deal with the register allocation nightmare because registers are essentially global variables.

Java takes the approach of presenting a stack machine with no registers and then letting the interpreter figure out register allocations at run time (or nearly so).

Life would be simpler if we had 1000 simple cores, each with their own small private set of registers, and they communicated by message passing. I.E. the Actor model in hardware.


The thing is, that exact approach was tried in the 80s with the transputer (https://en.wikipedia.org/wiki/Transputer) and it's a very compelling computation model, as you've pointed out.

However, what was discovered was that programming the machines and the analysis required to produce code that communicated that way was hard, with specialist languages like occam (https://en.wikipedia.org/wiki/Occam_(programming_language) being required, and all sorts of scaling issues appearing.

As a student during this period, I learnt occam and wrote stuff for various transputer development boards.

The basic principles live on in the XMOS processors which have found a niche in signal processing applications (https://en.wikipedia.org/wiki/XMOS) although there may be other applications i'm unaware of that has also adopted the idea.


Isn't this where Mr Moore's (of Forth fame) cpu designs have gone? (see https://www.greenarraychips.com).

Ok maybe not the 'actor' model per say (unless actor channels are thought of core interconnects (I think there is also a global state but I cant remember much of these details at the moment without re-watching the linked strangloop vid.)

https://www.youtube.com/watch?v=0PclgBd6_Zs&t=2s or https://news.ycombinator.com/item?id=22021262 is probably the best for groking


I'm still surprised GreenArray forth cpu matrix didn't have a come back yet.

cores could send and receive forth words/subprograms from neighbors, they had a bit of memory.. it's like a unified highlevel metaprogrammable memoryprocessor, that only consume minute amounts of power (and is async IIRC). it really tickles my brain in wild ways.

https://duckduckgo.com/?q=ga144+cpu&t=ffab&ia=web


Isolate the cache (no MESI) and we have that now, with dozens to hundreds of cores.

Frequency scaling remains dead, so we’ll get all those cores, slowly but surely.


> Life would be simpler if we had 1000 simple cores, each with their own small private set of registers, and they communicated by message passing. I.E. the Actor model in hardware.

Not for the developers. Also good luck making that secure in hardware.

The big fat core model is always easier to program for


I agree so much with:

> Life would be simpler if we had 1000 simple cores, each with their own small private set of registers, and they communicated by message passing. I.E. the Actor model in hardware.

But more cores.

(And the message passing and routing implemented in hardware of course.)


Much as I love Forth and stack-based architectures, I have to admit that they are a very niche technology. To understand why they never really took off you have to put a pile of computers beside a pile of cars.

The internal combustion engine is terrible. Noisy, inefficient, difficult to make, and somehow everything else we've tried is worse. More-or-less every car you've ever driven has been powered by a four-stroke four-cylinder engine. It might have had overhead valves or an overhead cam, it might have had a carburettor or electronic fuel injection, but it's still the same basic technology going back over 100 years. Why?

Because it sucks, but everything else sucks worse. Two-strokes are simple but fiddly to get running over a wide range of power settings, Wankels are light and simple and don't vibrate but are incredibly fiddly to make.

Likewise in computing, it turns out that you can do all sorts of clever tricks like hardware support for object-orientated memory like the Linn Rekursiv (the last known example of which is on the bottom of the Forth and Clyde Canal, in the north of Glasgow, which ought to tell you how successful a concept it was), or hardware stack-based processors like the RTX2010 and its predecessors or the HP3000 minicomputer, or CISC processors that took the first C to strange new places like the DEC VAX architecture that was planned to have single machine code instructions for solving polynomials and multiplying matrices. However, you almost never use these, and - if you actually look at what the compiler emits - you find you spend all your time shuffling values between registers and firing them into the ALU.

And that's where RISC comes in.

You need to move things from memory to registers (masses of registers) and back, or fire it into the ALU and get a result back quickly. Combine all these with conditionals based on the value of the status register, and you've got a very powerful machine instruction set that lends itself well to general-purpose computing. Sure, you need to write quite a lot of code to do your matrix multiply now, but that's okay because it's written in lots of small efficient steps, running on a small efficient hardware platform.

So why would you implement a stack machine these days, and how? Well, the obvious use for something like Forth is in bringing up a machine from bare metal, where with a couple of kB of assembly code (just enough to set up the hardware you want to use, and implement a couple of dozen instructions) and a couple of kB of Forth code (Forth is mostly written in Forth) you can have a working environment that you can actually write code in and control your new computer. Some CPUs are particularly well-suited to this, having enough memory index registers (or like the 6809 and friends, two stack pointers) to implement the parameter and return stack, and if you've got registers left over you can do things like top-of-stack-in-register saving a memory access or two. The trick with "Forth written in Forth" is that once you've got the inner interpreter running all you need to do to write words in Forth from your assembler is define each word's header and then a list of addresses of the words you want to call.

If you wanted to implement a stack machine in hardware, like if you were heavily into homebrew computing or paleocomputing, you could probably do it in a few dozen common discrete logic chips, a couple of 74LS181 ALUs (you can still buy these!) and a couple of EPROMs for the microcode.


> Two-strokes are simple but fiddly to get running over a wide range of power settings, Wankels are light and simple and don't vibrate but are incredibly fiddly to make.

eeeh, not exactly. Modern EFI and computer control would probably make two-strokes just fine but you're burning oil so it wouldn't pass any emission standards.

Wankel has... exactly the same problem, on top of maintenance issues, and fuel usage issues. They are not "fiddly to make", but fiddly to take care for, which means random don't-give-a-shit user might cause premature failure which also gets them bad reliability reputation.


Well, it's a bit of a simplification. Anyway it's a damn sight easier to make a four-stroke with a broad range of useful power than it is a two-stroke (and a Wankel is just a special case of two-stroke), without getting into stuff like YPVS and stuff.

They're fiddly to make in that they require really weirdass machining techniques and interesting materials for the seals, which are not yet something we've worked out how to do.


Wankel apex seals are still not a completely solved problem.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: