> *Stack machines that run some RPN form like Forth or Java code have been built...

Animats · on Feb 27, 2017

Pure Forth machines were interesting when the CPU clock and the memory ran at about the same speed, and the number of gates was very limited. Chuck Moore's original Forth machine had a main memory, a data stack memory, and a return stack memory, each of which was cycled on each clock. It took only about 4K gates to make a CPU that way.

Today, the speed ratio between CPU and main memory is several orders of magnitude. The main problem in CPU design today is dealing with that huge mismatch. That's why there are so many levels of cache now, and vast amounts of cleverness and complexity to try to make memory accesses fewer, but wider.

The next step is massive parallelism. GPUs are today's best examples. Dedicated deep learning chips should be available soon, if they're not already. That problem can be implemented as a small inner loop made massively parallel.

qznc · on Feb 27, 2017

> the speed ratio between CPU and main memory is several orders of magnitude

What if you compare scratchpad SRAM and an energy-efficient CPU?

pjc50 · on Feb 27, 2017

Compare in what way? e.g. the L1 cache is SRAM running at core speed with low latency (4 cycles for Haswell).

qznc · on Feb 27, 2017

A factor of 4 instead of orders of magnitude means a Forth machine might still be worthwhile. No?

pjc50 · on Feb 27, 2017

Where's the Forth machine getting its operands from?

Sure, if you constrain your programs to use a tiny area of memory you might be able to achieve theoretical speed, but what workloads can you achieve with that?

If you were to write a browser in Forth, presumably it would still have to store all its DOM in DRAM?

qznc · on Feb 27, 2017

You probably want to parallelize the browser for energy efficiency. Then you can distributed the DOM across the scratchpads of hundreds of cores maybe?

pjc50 · on Feb 27, 2017

So each core has its own JS engine, or when you iterate across the DOM with a selector you have to query across all the nodes? This doesn't sound great.

(The "pile of cores with scratchpads" exists e.g. Tilera and Netronome, and they're a right pain to program for)

hvidgaard · on Feb 27, 2017

Superscalar is not about the number of processors, but a single cores ability to run multiple instructions in parallel. The most well known example of this is SIMD.

minipci1321 · on Feb 27, 2017

> to run multiple instructions in parallel.

> The most well known example of this is SIMD.

SImd == Single Instruction (multiple data). Correct explanation, unsuitable illustration.

hvidgaard · on Feb 27, 2017

SIMD is an example of super scalar design. It is running multiple instructions in parallel, they all just happen to be the same instruction. I could have said MIMD, but that is not a term that is well known.

gpderetta · on Feb 27, 2017

I wouldn't call SIMD superscalar. The complexity of a superscalar design is being able to track multiple instructions, their dependencies and their out of order completion [1]. Classical SIMD machines run every lane in lock step.

[1] not OoO issue, that would be a proper Out Of Order CPU.

hvidgaard · on Feb 27, 2017

It's a classical example of how much definition matters. Though, if we define superscalar to mean the instructions can run independenly (i.e. not in lockstep), then I agree that a single SIMD unit is not superscalar. But a design with 2 SIMD units that operate on 2 different data steams independently would be be a superscalar design.

gpderetta · on Feb 27, 2017

I'm no expert, but I think my definition is what historically has been considered superscalar.

Yes, the design you described is definitely superscalar, but the fact that the two streams are simd is incidental.

hvidgaard · on Feb 27, 2017

I cannot call myself an expert, but I do have some experience in the domain. The classic example of ILP in a superscalar design is:

   a = b + c
   d = e + f
   g = a + d

Where calculation of a and d is executed in parallel.

Wikipedia is not entirely clear either, but the entire page gives the impression that it should send instructions to multiple execution units in parallel, in which SIMD would be a single execution unit.

minipci1321 · on Feb 27, 2017

> Where calculation of a and d is executed in parallel.

Why then not to try to reason from the assembler/machine code standpoint?

The parallel calculation above could be done in different ways:

a) the compiler would emit two "scalar" ADD instructions following one another (allocate registers so independent execution is possible).

b) the compiler would coalesce both additions b+c and e+f into one vector operation (let's assume data layout makes such optimization useful), and emitted only one "vector" (SIMD) instruction.

In the case a) the two scalar instructions would be fetched "sequentially" by the prefetch unit, but executed in parallel, in two separate instances of the adder ==> "super-scalar".

In the other case, the vector operation would not be paired with another instruction which is part of the computation above you mentioned.

EDIT: corrected typo in operands of the coalesced additions.