

MuP21 – A High Performance MISC Processor (1995) - luu
http://www.ultratechnology.com/mup21.html

======
lgeek
I'll bite the bullet: Funnily enough, I think this sort of design fails to
account for Moore's law (different Moore, you see).

The principle here is to drop many of the principles that make modern
processors fast: pipelines, caches, large register files, rich instruction
sets (e.g. different instructions for different data sizes, DSP-style
instructions, SIMD, etc). You end up with a really simple core, easy to
implement, cheap in terms of die area. This might work well for certain
applications (maybe as a power efficient microcontroller), but you're left
with two major issues: 1) you have no other way to increase its performance
apart from increasing the clock frequency - we know that pretty much stalled;
and 2) you'll discover at some point that while hitting the cache is expensive
in terms of energy, it's still better than hitting the off-chip RAM, and since
you always have to go to the RAM... Apart from that, DRAM bandwidth has sort
of kept up with processor frequency, but latency hasn't.

At this point, you might think that this design would work well in a manycore
system. Well, the problem with that is that we haven't quite figured out the
algorithms and software for many-core systems. Pretty much anything apart from
data-parallel problems scales worse than linearly on manycore.

That being said, there's one feature that I quite like: having an exposed
separate return stack could improve the performance of virtualisation (dynamic
translation), JIT engines and so on. Modern architectures have an internal RS
used for branch prediction, but it's usually not architecturally exposed.

tl;dr: Moore's law led to fast RISC/CISC, MISC can only scale horizontally
(many-core), which limits its usefulness.

~~~
willvarfar
It's well worth watching the talks about the new Mill processor design, as it
claims to tackle precisely this head-on.

And the videos make excellent watching:

[http://ootbcomp.com/docs/](http://ootbcomp.com/docs/)

------
lambda
This was designed by Chuck Moore, the inventor of Forth, and who's still doing
work in this space, except instead of a single very simple core like this,
he's making chips with 144 extremely simple cores.
[http://www.greenarraychips.com/](http://www.greenarraychips.com/)

~~~
unwind
Ah, that explains the almost non-existant explanation of the design being a
stack machine. :) I was a bit surprised when it said something along the lines
of "the processor has OVER but not SWAP"; those are very stack-centric
instructions.

Also, the "M" in MISC was never explained which irked me. Is it a pun, as in
"miscellaneous"? Or "minimal"? It's strange to propose a name that feels a lot
like a new acronym, but never spell it out.

~~~
Sanddancer
Minimal. He chose it because it only has 21 operations, as opposed to RISC
machines which had 50 or so at the time.

------
cfallin
This is certainly an interesting read, and stack machines have a long history
among alternative CPU architectures. But I think that the complaints this
(admittedly 18-year-old) article levels against mainstream CPUs are largely
unfounded with today's chips. Or at least, increasing complexity is not a
cause of decreasing performance, as is claimed, but rather a major reason that
we've had such amazing leaps in computing power. Pipelining, branch
prediction, out-of-order execution, multilevel caching, and clever tricks in
many places all enable 3GHz CPUs that can peak at 4 instructions per cycle
today. You'll lose two orders of magnitude in performance if you try to throw
all that out and go back to basics, as the article suggests. (You may reduce
more than two orders of magnitude in power, and power is an interesting
constraint in today's systems; but the article didn't specify low power as the
goal.)

\- "Modern RISCs are slow because instructions take many cycles to execute":
Latency is irrelevant as long as the pipeline remains filled. Pipeline length
only matters when it's in the critical path, e.g. cycles from fetch to branch
execution determines the bubble size on a mispredicted branch.

\- "The pipeline must be flushed and refilled on branches"? Only on
mispredicts, which are rare with good predictors on most code.

\- "Cache makes the system more complex and expensive"? But it also is the
only way to have acceptable performance when DRAM accesses take 200 cycles or
more.

\- "RISCs are slow at calls and returns" due to many registers? A good
compiler will only save as many registers as it clobbers; more registers
reduce register pressure and spills/fills, improving perf. (This is a net perf
increase from x86-32 to x86-64, for example.) The only significant cost is
additional context-switch overhead.

~~~
cpleppert
It is very clear from related descriptions of the F21 CPU(a related design
[http://www.ultratechnology.com/f21cpu.html](http://www.ultratechnology.com/f21cpu.html))
that the intention behind the design is to reduce the bandwidth requirements
for fetching data instructions thus the rather convoluted claim that the CPU
runs at four times the speed of memory because it can fetch four five bit
instructions from a single memory word access.

This really doesn't make sense and when combined with sentences like
"Increasing speed in the RISC processor creates a large disparity between the
processor and the slower memory. To increase the memory accessing speed, it is
necessary to use cache memory to buffer instruction and data streams. The
cache memory brings in a whole set of problems which complicate the system
design and render the system more expensive." I'm really quite at a loss.

The claim is that the processor is faster because it is simpler so it has a
higher instructions per clock and can avoid a high clock disparity between CPU
and ram. Well, if your processor is faster it doesn't matter if it comes from
IPC or faster clock; memory will still be slow so this design philosophy
simply doesn't make sense to me.

------
zvrba
What, no integer division, multiplication, no floating-point.. Each of these
would take a _lot_ of instructions to emulate with the basic 21 instruction,
so the CPU would need to run on at least 100GHz to match the performance of
today's CPUs.

Plus, since it's a stack machine, you'd lose all the goodies of superscalar,
OoO execution, etc. (there's constant contention on the top of stack)

CISC has won (heck, even ARM is not RISC anymore). Deal with it :)

~~~
lgeek
> CISC has won (heck, even ARM is not RISC anymore). Deal with it :)

If by RISC you literally mean a reduced number of instructions, I guess that's
a valid point. But RISC is mostly used to describe instruction sets in which
each instruction performs a limited amount of work, especially memory
accesses. That's why RISC architectures are generally load/store, while CISC
architectures allow access to multiple discontinuous memory areas in a single
instruction.

By the way, modern x86 chips are RISC machines that translate the x86
instruction set to native microcode.

------
e12e
On a somewhat related note, I thought one way to play with a RISC-like
instruction set (on real hardware) would be to create a compiler that emitted
a reduced set of valid x86 assembly, a "Risc86" so to speak. Apparently I'm
not alone:

[http://spivey.oriel.ox.ac.uk/corner/Risc86](http://spivey.oriel.ox.ac.uk/corner/Risc86)

Here, apparently used for teaching -- but I wonder if it would be easier to
implement a CPU that parsed a subset of x86, rather than a completely new CPU
(or "CPU" running on an FPGA) -- with the added bonus of being able to run
binaries both on the new chip, as well as on "legacy" x86 machines...

Sounds fun, but I guess using OpenSparc or LEON would make more actual
sense...

------
gcb0
A jvm on top of one of those sounds like a sweet machine. Was this ever tried?

I know close to nothing of chip design or Hardcore low level code
optimization, but the stack style described here seems like a good match to
jvm...

~~~
cfallin
I think while interesting, the semantic gap is probably too large. E.g. the
machine stores machine words on the stack, but the JVM stores Java values. And
the instruction set of the JVM is much richer, and it knows about many more
concepts (types, classes, objects, boxing/unboxing, ...). So you'll have a
significant translation layer anyway. If you have to pay that cost, then doing
register allocation onto a conventional set of registers is probably cheap in
comparison. And you want the conventional machine for its much, much higher
performance.

