Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Stack machines that run some RPN form like Forth or Java code have been built, but don't go superscalar well.

I've been interested in Forth (and related stack) processors for a while, and my armchair observations over a few months have suggested that the (much-vaunted) performance gains associated with such processor designs are apparently not straightforward to map or relate to or take advantage of.

I remember (unfortunately not sure where right now) reading how the GA144 was built at a time when 18-bit memory was the current trending novelty and that it's not really a perfect processor design. I'm still fascinated by it though (sore lack of on-chip memory notwithstanding).

What sort of scale are you referring to when you say "superscalar"? 144 processors? 1000? Do stack-based architectures remain a not-especially-practical-or-competitive novelty, or are they worth pursuing outside of CompSci?

(FWIW, everything else you've written is equally interesting, but slightly over my current experience level)



Pure Forth machines were interesting when the CPU clock and the memory ran at about the same speed, and the number of gates was very limited. Chuck Moore's original Forth machine had a main memory, a data stack memory, and a return stack memory, each of which was cycled on each clock. It took only about 4K gates to make a CPU that way.

Today, the speed ratio between CPU and main memory is several orders of magnitude. The main problem in CPU design today is dealing with that huge mismatch. That's why there are so many levels of cache now, and vast amounts of cleverness and complexity to try to make memory accesses fewer, but wider.

The next step is massive parallelism. GPUs are today's best examples. Dedicated deep learning chips should be available soon, if they're not already. That problem can be implemented as a small inner loop made massively parallel.


> the speed ratio between CPU and main memory is several orders of magnitude

What if you compare scratchpad SRAM and an energy-efficient CPU?


Compare in what way? e.g. the L1 cache is SRAM running at core speed with low latency (4 cycles for Haswell).


A factor of 4 instead of orders of magnitude means a Forth machine might still be worthwhile. No?


Where's the Forth machine getting its operands from?

Sure, if you constrain your programs to use a tiny area of memory you might be able to achieve theoretical speed, but what workloads can you achieve with that?

If you were to write a browser in Forth, presumably it would still have to store all its DOM in DRAM?


You probably want to parallelize the browser for energy efficiency. Then you can distributed the DOM across the scratchpads of hundreds of cores maybe?


So each core has its own JS engine, or when you iterate across the DOM with a selector you have to query across all the nodes? This doesn't sound great.

(The "pile of cores with scratchpads" exists e.g. Tilera and Netronome, and they're a right pain to program for)


Superscalar is not about the number of processors, but a single cores ability to run multiple instructions in parallel. The most well known example of this is SIMD.


> to run multiple instructions in parallel.

> The most well known example of this is SIMD.

SImd == Single Instruction (multiple data). Correct explanation, unsuitable illustration.


SIMD is an example of super scalar design. It is running multiple instructions in parallel, they all just happen to be the same instruction. I could have said MIMD, but that is not a term that is well known.


I wouldn't call SIMD superscalar. The complexity of a superscalar design is being able to track multiple instructions, their dependencies and their out of order completion [1]. Classical SIMD machines run every lane in lock step.

[1] not OoO issue, that would be a proper Out Of Order CPU.


It's a classical example of how much definition matters. Though, if we define superscalar to mean the instructions can run independenly (i.e. not in lockstep), then I agree that a single SIMD unit is not superscalar. But a design with 2 SIMD units that operate on 2 different data steams independently would be be a superscalar design.


I'm no expert, but I think my definition is what historically has been considered superscalar.

Yes, the design you described is definitely superscalar, but the fact that the two streams are simd is incidental.


I cannot call myself an expert, but I do have some experience in the domain. The classic example of ILP in a superscalar design is:

   a = b + c
   d = e + f
   g = a + d
Where calculation of a and d is executed in parallel.

Wikipedia is not entirely clear either, but the entire page gives the impression that it should send instructions to multiple execution units in parallel, in which SIMD would be a single execution unit.


> Where calculation of a and d is executed in parallel.

Why then not to try to reason from the assembler/machine code standpoint?

The parallel calculation above could be done in different ways:

a) the compiler would emit two "scalar" ADD instructions following one another (allocate registers so independent execution is possible).

b) the compiler would coalesce both additions b+c and e+f into one vector operation (let's assume data layout makes such optimization useful), and emitted only one "vector" (SIMD) instruction.

In the case a) the two scalar instructions would be fetched "sequentially" by the prefetch unit, but executed in parallel, in two separate instances of the adder ==> "super-scalar".

In the other case, the vector operation would not be paired with another instruction which is part of the computation above you mentioned.

EDIT: corrected typo in operands of the coalesced additions.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: