
RISC Is Unscalable - signa11
https://blackhole12.com/blog/risc-is-fundamentally-unscalable/
======
dgacmu
The article is conflating three separate issues:

(1) The amount of compute and/or data motion that can be achieved with a
single instruction. This is really about amortizing the cost of decode and
allowing the pipeline to be kept full by producing a lot of work from a single
instruction.

(2) efficiency gains from vector processing. This both amortizes the cost of
decode and produces the amount of control circuitry relative to the number of
ALUs --> more flops/area. it also generally favors larger sequential memory
accesses which is good for bandwidth.

(3) extracting parallelism from the instruction stream. The VLIW debate is
about whether that should be done by the CPU itself, e.g., in the form of out-
of-order execution, or whether it should be handled by the compiler. VLIW
allows the compiler to do this work, which keeps the CPU simpler.

It's been clear for several years that larger vectors are a win, and that's
been happening in the Intel and arm space, not to mention GPU. The VLIW debate
is less clear, and has been going back and forth. I think that we have been
doing a better job of getting a handle on when instruction complexity and
diversity is beneficial versus not - remember that the initial RISC proposal
was in contrast to the PDP instruction set, which was kind of ridiculously
over specialized for the technology of the time.

~~~
simias
>remember that the initial RISC proposal was in contrast to the PDP
instruction set, which was kind of ridiculously over specialized for the
technology of the time.

That's a good point and I expect that it still somewhat holds true today: many
of the RISC supporters (and I'm one of them) effectively assimilate CISC with
x86 since that's by far the most popular CISC instruction set out there these
days[1]. And x86 is the C++ of instruction sets: decades and decades of legacy
feature which shouldn't be used in modern code, a large selection of paradigms
copied from whatever other instruction set was popular at the time. You want
to do BCD? We have opcodes for that (admittedly no longer supported in long
mode, but it's there). Also do you like prefixes? Asking for a friend.

But obviously that's a bit of a fallacy, one could design a modern CISC-y IA
without all the legacy baggage of x86. It's not so much that I love MIPS or
RISC V, it's that I don't want to have to deal with x86 anymore.

[1] We don't yet consider ARM CISC, right?

~~~
Taniwha
Actually it was in contrast to the Vax instruction set ... the PDP instruction
set is something else (and some Vax's had a PDP emulation mode ....)

~~~
monocasa
VAX was a crazy town ISA too, with stuff like single isntruction polynomial
evaluation.

~~~
dragontamer
Crazy for its age, but all modern processors (x86, ARM, and Power9) implement
a polynomial multiplication.

ARM: vmull.p8

Intel: PCLMULQDQ

Power9: I forget, but trust me, its in there somewhere.

IIRC, they're all 64-bit carryless (or polynomial) multiplications. The
funniest part is that ARM and Power9 still claim to be "reduced instruction
set computers", despite having highly specialized instructions like these.
Power9 is the funniest: implementing hardware-level decimal-floats (for those
bankers running COBOL), quad-floats (for the scientific community),
Polynomial-Multiplication and more.

~~~
floatboth
Well, the meaning of "reduced" has shifted from "not having specialized
instructions" to "load-store and preferably fixed-length".

Here's a CRC32 impl using POWER8/9 polynomial instructions:
[https://github.com/antonblanchard/crc32-vpmsum](https://github.com/antonblanchard/crc32-vpmsum)
— coincidentally, you don't have to do this on ARM, ARM just has CRC32
instructions.

> hardware-level decimal-floats (for those bankers running COBOL)

IBM being IBM :) I've heard somewhere that they are sharing some internals
between POWER and z/Architecture (mainframe) chips. No idea how true it is and
how much is shared if true, but a scenario like "decimal floats were
implemented because mainframes are using the same backend with a different
frontend, so we exposed them in the POWER frontend too" sounds quite
plausible.

~~~
CalChris
_Well, the meaning of "reduced" has shifted from "not having specialized
instructions" to "load-store and preferably fixed-length"._

Ain't that the truth. _Reduced_ never meant anything; they never defined it.
Patterson said they cooked up the name on the way to a grant presentation. I
think they should call it _Warmed Over Cray._ What was new in RISC that wasn't
in Cray? It was like the Geometry Engine; it was _possible_ for an academic
department to accomplish but that doesn't mean they accomplished anything that
hadn't already been done.

 _preferably fixed-length_

Or fixed-length/2.

------
byefruit
This is relatively misinformed piece of writing.

There is little to no reason why high performance RISC-V implementations can
not achieve performance comparable to modern "CISC" cores.

As other commenters have said, VLIW doesn't suddenly make the problem go away
and no amount of compiler magic is going to fix the fundamental issue that
memory access on a modern processor is highly non-uniform and scheduling is
not something that can be done statically ahead of time.

RISC-V is designed to be extended with complex instructions and accelerators
and that's the scalable part.

~~~
bem94
I think it depends on how you interpret "reduced"

\- Does it mean reducing the actual number of instructions in the ISA?

\- Does it mean reducing the functionality of each instruction to it's
absoloute minimum?

These are similar, and overlapping in places. You can also do one without the
other.

I think that the base RISC-V ISA does _both_ to a perhaps unhelpful degree. As
soon as you want a RISC-V core to be competative with other peer ISAs, you
need a bunch of the extensions, which minimises the value in calling it
reduced in the first place. At the least, it exposes a possible dichotomy
between "reduced-ness" and "usefullness".

~~~
nickik
RISC-V even with the basic extention set is still far smaller then the
competition. And when you add the 'V' extention it is also considerably
smaller comperative SIMD instructions.

A comperable RISC-V core in terms of feature, will always have less
instruction.

~~~
floatboth
Indeed, but that just means RISC-V designers have pursued the "less
instructions" goal for its own sake. Having less instructions is not
inherently beneficial. It can actually be detrimental: common operations
require more instructions.

[https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d99...](https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68)

~~~
nickik
The RISC-V designers disagree. A number of typical patterns can be optimized
easly by macro-op fusion in higher performance cores. The benfit is a decoder
that has an incredibly small size, minimizing minimal core for RISC-V.

Also RISC-V preferes less in in the core spec because if you have application
that really needs something there will be standard extentions to add it.

~~~
floatboth
> minimizing minimal core for RISC-V

Again: why should I care about _that_? As a user, I care about big desktop-
class cores, not academic minimal cores that fit on FPGAs. Small decoder is
not a benefit for real world usage.

> there will be standard extentions to add it

That is, there will be fragmentation.

~~~
nickik
> Again: why should I care about that?

Have you considered that the world doesn't revolve around you?

You are infact wrong, many 'real world' uses like how small these cores can
be.

And given that many SoC now have lots and lots of small cores in them, having
those cores be as small as possible is beneficial.

> That is, there will be fragmentation.

Yes. There will be fragmentation because an industry that is so broad, in
terms of minimal soft cores to massive HPC systems, so having a true one-size-
fits-all would have been doomed from the beginning and was never a viable
design goal.

RISC-V is design to approch the problem of a universal open-source ISA. It
knows that avoiding fragmentation is impossible and thus they tried to build
something that makes fragmentation managable both in terms of the organisation
of the standard and in terms of the tools.

------
teton_ferb
The author fundamentally misunderstands what RISC is, what CISC is, what SIMD
is and what VLIW is apart from misunderstanding every major computer
architecture concept

RISC was invented as an alternative approach in an era when processors had
really complex instructions, with an idea that high level languages could be
efficiently compiled to them and assembly programmers would be efficient if
they can do many things with one instruction. RISC philosophy was to make
simpler instructions, let compilers figure out how to map high level languages
to simple instructions, and therefore fit the processor on one die (yes,
"processors" used to be several chips) and therefore run it at high clock
speeds. RISC is not a dogma, it is a design philosophy.

On top of that exception handling in complex instructions is hard.
Implementing complex instructions in hardware consumes considerable design and
validation effort. RISC has won for these reasons.

Some things have changed, we can fit really complicated processors on a single
die. Memory access is the bottleneck. The downsides of RISC in this reality is
well known: It takes many more instructions to do the same thing, which means
instruction cache is used inefficiently (anyone remember the THUMB instruction
set of ARM?). It might be useful to add application-specific hardware
acceleration features, because we now have the transistors to do it. How does
that make RISC unscalable?

Many CISC machines (eg Intel's) are CISC in name only. The instructions are
translated to micro-ops in the hardware. The micro-ops and the hardware
itself, is RISC.

Register re-naming was invented to ensure that we enjoy the benefits of
improvements in hardware without having to recompile. Let us assume you have a
processor with 16 registers. You compiled software for it. Now we can put in
32. What do you do? Recompile everything out there or implement register re-
naming?

VLIW failed because they took the stance that if we remove hardware-based
scheduling, the extra transistors can be used for computation and cache.
Scheduling can be done by compilers anyway. The reason they werent successful
is because if a load misses the cache, you wait. Instead of superscalars which
would have found other instructions to execute. On top of it, if you had a
4-wide VLIW and then you wanted to make a 8-wide one, you had to recompile.
And oh, the "rotating registers" in VLIW is a form of register re-naming.

Poorly informed article.

~~~
adwn
> _Many CISC machines (eg Intel 's) are CISC in name only. The instructions
> are translated to micro-ops in the hardware. The micro-ops and the hardware
> itself, is RISC._

This is an oft-repeated but incorrect statement. Modern x86 CPUs perform
_macro-op fusion_ and _micro-op fusion_. As an example of the former, a
comparison and a jump instruction can be fused into a single micro-op [1],
which is decidedly non-RISC. As for the latter, some micro-ops perform a load
from memory and an arithmetic operation with the retrieved value – also very
non-RISCy.

Modern x86 CPUs are CISC above and below the surface.

[1] [https://en.wikichip.org/wiki/macro-
operation_fusion#x86](https://en.wikichip.org/wiki/macro-operation_fusion#x86)

~~~
monocasa
Yeah, vertical microcode always looked pretty RISCy if you're not used to it.

And FWIW, I've heard that there's two distinct uOp formats inside a.single
core these days for quite a few of the uArchs. There's the frontends view
which is concerned with amortizing decode costs (so wide fixed purpose
instructions, that other wise look pretty CISC), and the backend's uOps that's
concerned with work scheduled on functional units. A lot of the fusion happens
on the front end, and a lot of the cracking happens on the backend, and the
frotbejd tends to be two address, and the backend three address. So like a
frontend's

    
    
       and rax, [rbx, addr]
    

is something like

    
    
       ld_agu t0, rbx, addr
       ld t1, t0
       and rax, rax, t1
    

on the backend.

------
dragontamer
1\. CPUs are about minimizing latency. CPUs aren't designed to scale, they're
designed to execute your (presumably single-threaded) program as quickly as
possible. This means speculatively executing "if" statements and speculatively
predicting loops, renaming registers and more.

2\. GPUs are the scalable architecture: The 2080 Ti has 4352+ SIMD cores (136
compute units). And NVidia can load 30+ threads per compute unit, so 130560+
conceptual threads (kinda like hyperthreading) can exist on a GPU executing at
once.

3\. VLIW seems like a dead end. AMD GPUs gave up the VLIW instruction set back
in 2008. Instead, the SIMT AMD GCN or NVidia PTX model has been proven to be
easier to program, easier to schedule, and easier to scale. If you want high-
scaling, you should do SIMD or SIMT, not VLIW. Intel, RISC, ARM, and Power9
have all chosen SIMD for scaling (AVX, ARM-SVE, Power9 Vector Extensions,
NVidia PTX, and AMD GCN).

4\. I think VLIW might have an opportunity for power-efficient compute.
Branch-prediction and Out-of-order execution of modern CPUs relies on
Tomasulo's algorithm + speculation, which feels like it wastes energy (in my
brain anyway). VLIW would bundle instructions together and require less
scheduler / reordering overhead. If a company pushed VLIW as a power-efficient
CPU design... I think I'd believe them. But VLIW just seems like it'd be too
unscalable compared to SIMD or SIMT.

5\. Can we stop talking about RISC vs CISC? Today's debate is CPU (latency-
optimized) vs GPU (bandwidth-optimized). The most important point is both
latency-optimized and bandwidth-optimized machines are important. EDIT:
Deciding whether or not a particular algorithm (or program) is better in
latency-optimized computers vs bandwidth-optimized computers is the real
question.

~~~
atq2119
All good points, but:

> VLIW seems like a dead end.

Since you mentioned this in the same breath as GPUs, I feel I have to point
out that according to some reverse engineering paper, Nvidia's Turing is a
VLIW architecture. (I'm talking about the actual hardware here, not PTX.)

Presumably they have some reason for that that's unrelated to increasing IPC,
since AFAIK their GPUs aren't superscalar.

~~~
dragontamer
> Since you mentioned this in the same breath as GPUs, I feel I have to point
> out that according to some reverse engineering paper, Nvidia's Turing is a
> VLIW architecture. (I'm talking about the actual hardware here, not PTX.)

NVidia Volta / Turing has been reverse engineered here:
[https://arxiv.org/abs/1804.06826](https://arxiv.org/abs/1804.06826) . EDIT:
I'm talking about the actual hardware, the SASS assembly, not PTX.

It doesn't look like a VLIW architecture to me. Page 14 for the specific
instruction-set details. I realize this is mostly a matter of opinion, but...
those control-codes are very different from the VLIW that was implemented in
the Itanium.

> On Volta, a single 128-bit word contains one instruction together with
> thecontrol information associated to that instruction.

So NVidia has a 128-bit instruction (16-bytes) with a LOT of control
information involved. The control information encodes read/write barriers, but
there is still one-instruction per... instruction.

The "core" of VLIW was to encode more than one instruction-per-bundle. Itanium
would encode maybe 3-instructions per bundle for example.

What NVidia is doing here is having the compiler figure out a bunch of
read/write/dependency barriers so that the GPU won't have to figure it out on
its own (I presume this increases power-efficiency). The only thing similar to
VLIW is that NVidia has a "complicated assembler" which needs to figure out
this information and encode it into the instruction stream. Otherwise, it is
clearly NOT a VLIW architecture.

> Presumably they have some reason for that that's unrelated to increasing
> IPC, since AFAIK their GPUs aren't superscalar.

NVidia Turing can execute floating-point instructions simultaneously with
integer-instructions. So they are now superscalar. The theory is that
floating-point units will be busy doing the typical GPU math. Integer-
instructions are primarily used for branching / looping constructs (and not
much else in graphics-heavy code), so they are usually independent operations.

------
cwzwarich
> (which, incidentally, also makes MILL immune to Spectre because it doesn’t
> need to speculate)

Mill still provides speculation, but it is exposed as part of the user-visible
architecture. Before Spectre was publicized, the Mill team proposed using
speculation in a way that would make Mill systems vulnerable to Spectre:

[https://news.ycombinator.com/item?id=16125519](https://news.ycombinator.com/item?id=16125519)

There is an obvious fix for this (avoid feeding speculated values into address
calculations), but they didn't say how much it costs in terms of performance.

~~~
thomasjames
Is Mill happening? Has there been silicon yet? I have been reading about it
for so long, but have not seen any peer reviewed work on it, silicon or soft
cores.

I would like to see more exotic architectures out there, but I think I speak
for many when I say that I am starting to question if the Mill architecture is
going to land in a major way.

~~~
zaarn
If it happens or not, the Mill team still produced some of the most
interesting talk videos on youtube that you can watch. I would highly
recommend anyone who hasn't watched the series to do so, it's very in depth
and interesting.

Personally I hope that Mill happens. If it will beat existing CPUs left and
right or not can then finally be answered.

------
BubRoss
He talks about 'the laws of physics' meaning that RISC can't scale, neglects
prefetching completely and then talks about the vaporware mill CPU as being
some sort of solution because it does 'deferred loads that take into account
memory latency'

~~~
zaarn
That sounds a bit dismissive.

Deferred loads do have a massive advantage; your compiler knows best when
loads are needed, so it can, for example, run the load for the next array
value while still processing the current one, allowing the CPU to not stall at
all at highest efficiency (because the compiler can know instruction latencies
and counts out when to start the load).

Prefetching is an optimization of caching, it still doesn't beat the speed of
signal in a silicon CPU, nor does it solve the scalability issues mentioned in
the article.

~~~
BubRoss
You have your facts almost completely backwards. Compilers don't know best and
any company who has banked on that for general purpose CPUs has been burned
hard. Your explanation of what a deferred load is, is actually more of a
description of prefetching.

The compiler also does not necessarily know instruction latencies since they
change from one CPU to the next.

Out of order execution already does the job of deferred loads. Loads can be
executed as soon as they are seen and other instructions can be run later when
their needed memory has made it to the CPU. This is why Haswell already had a
192 instruction out of order buffer. OoO execution also schedules instructions
to be run on the multiple execution units and ends up doing what compilers
were supposed to do with VLIW CPUs.

> "Prefetching is an optimization of caching, it still doesn't beat the speed
> of signal in a silicon CPU, nor does it solve the scalability issues
> mentioned in the article."

None of this is true. Prefetching looks at access patterns and pulls down
sections of memory before they are accessed. Caching is about saving what has
already been accessed. I'm not sure what you mean by 'beating the speed of
signal' but if you are talking about latency, that is exactly what it deals
with. By the time memory is needed it is already over to the CPU. The article
talks about issues that are due to memory latency (which much of modern CPUs
features deal with on way or another) and prefetching directly confronts this.
Instruction access that happens linearly can be prefetched.

~~~
CalChris
There's SW prefetch instructions and HW prefetching engines. SW prefetch
instructions have largely been a bust. Linus is famous for ranting about them.

[https://yarchive.net/comp/linux/software_prefetching.html](https://yarchive.net/comp/linux/software_prefetching.html)

HW looks at access patterns (as you say) and does at least as good a job.

~~~
BubRoss
Yep, I've tried to use prefetch instrinsics multiple times and I've never been
able to beat the CPU and speed things up.

------
jasonhansel
Reminder: all current, performant CISC CPUs just internally (in microcode)
compile instructions down to smaller uOPs, which are then executed by a RISC-
like core. CISC chips are just RISC chips with fancier interfaces.

~~~
kllrnohj
CISC CPUs also take multiple instructions and combine them into larger uOPs
through macro-op fusion.

But whether or not the CPU's uOPs are RISC are not isn't really relevant here.
The article is talking about pressure on things like L1. The argument is that
CISC becomes almost a form of compression. If the CPU internally splits it
into multiple uOPs that's fine - you still got the L1 savings, and those uOPs
can potentially be more specialized. The CPU doesn't need to look ahead to see
if the intermediate calculation is kept or anything like that.

As in, it's overall more efficient to take a macro op and split it than take
micro ops directly.

------
kazinator
I don't agree with the proposition that vector instructions (SIMD) are
inherently non-RISC. RISC is about whether the "I" is a reduced instruction
set, not whether or not "D" is multiple.

------
nickik
This is nothing new and unnessesarly competative.

RISC-V is designed to make pipelining very efficent but there has always been
a limit. RISC-V just helps you get close to that limit with limited
complexity.

Beyond that, part of RISC-V will be the 'V' standard extension that will give
you access to a advanced vector engine that is a improvment on many of the
ways we do SIMD now.

------
Azerb
The irony here is most modern CISC design are breaking instructions to RISC-
like μOps. Moore's law also means - you have more transistors for the same
area, now figure out how to use them creatively to increase performance.
Workloads are constantly evolving and hardware evolves with it to make those
workloads fast.

~~~
api
CISC since around the turn of the millennium is basically a custom tuned high
decode speed data compression codec for RISC-like micro-ops. It's been a very
long time since anyone designed a CISC processor that actually ran (non-
trivial) CISC instructions directly in silicon.

The root of CISC's persistent dominance over true RISC instruction sets is
that memory bandwidth is _far_ lower what would be needed to feed micro-ops
directly into the CPU. It makes sense to solve that by compressing the
instruction stream. RISC looks far better on paper in every other way if you
ignore memory bandwidth and latency issues.

That being said, I've wondered for many years about whether a more conscious
realization of this might lead to a more interesting design. Maybe instead of
CISC we could have CRISC, Compressed Reduced Instruction Set Computer? Instead
of CISC you'd have some kind of compression codec that defines macros
dynamically. I'm sure X64 and ARM64+cruft are nowhere near optimal compression
codecs for the underlying micro-op stream. If someone wants to steal that idea
and run with it, be my guest.

~~~
wvenable
The other advantage of the CISC is that it acts like a higher-level API. Many
early RISC designs suffered because they were so low-level that early
implementation details (like wait states) had to be "emulated" in later
processors for compatibility.

It might not be advantageous to just compress a RISC stream of instructions
instead of higher level instructions made up of micro-ops for that reason
alone.

------
temac
Note that the post seems to be discussing about the old-school RISC reasoning,
more than what we consider "RISC" in the modern world. Now RISC is just a
style of ISA (and/or more informally some description of _some_ internal
aspects of pipelines of CPUs (regardless of the ISA), behind the decoder and
register renaming - and even then that's very far from all aspects)). And btw
just because an instruction is called “Floating-point Javascript Convert to
Signed fixed-point" with "Javascript" in its name does not disqualify it for
appearing in a "RISC" ISA (or _even_ an old-school RISC uarch). At all.

The (proven) modern high perf microarchitecture for generalists CPUs is pretty
much the same for everybody nowadays (well some details vary of course, but
the big picture is the same for everybody), so a RISC ISA is not necessarily
very interesting anymore, but also not necessarily a big problem. However if
we take things that still can have in impact on the internals, I would prefer
RMW atomic instructions to LL/SC most of the time (arguably LL/SC is the
"RISC" way to go). Hell I would even sometimes want posted atomics...

So back to the topic: if RISC is to be interpreted as microcode like/near
programming, yeah the RISC approach failed for state of the art high perf in
the long run, and it even was a very long time ago that it did -- but was it
even thought by anybody that it would win? Not so sure. It was more a nice
design point for another era, that only lasted a few years. Anyway the term
has become overloaded a lot, and has produced crucial results, even if
indirectly. And I agree a lot with the opinion that Skylake/Zen-like uarch is
the way to go, and even VLIW is dubious (or even already known as a failure if
we take Itanium as the main example) -- I don't even think the Web or anything
can save it, to be honest. But conflating CISC with the presence of an
instruction useful for Javascript is nowhere near the right interpretation of
the term "RISC", regardless which is chosen. I mean at a point you absolutely
can have a wide variety of execution units, without that disqualifying your
from having the main "RISC" aspects.

------
crb002
Surprised with FPGAs that there isn't a wave of new assembly languages. Or
that processors in memory haven't taken off.

Parallel prefix, vector addition, vector transpose, 32x32 dense matrix
multiply ...

~~~
jlokier
The basic reason is price, which appears to be down to commercial decisions,
rather than manufacturing cost.

FPGAs which are large and fast enough to make interesting CPUs are readily
available, but they are far too expensive, compared with buying an equivalent
CPU core. For perspective, the FPGAs you can rent on Amazon are listed at
around USD $20,000 per individual FPGA chip!

Much cheaper and smaller FPGAs exist, but they aren't generally cost-effective
and performance-compatitive against an equivalent design using a CPU, even
with the advantages provided by custom assembly languages and hardware
extensions.

There are times when it's worth it. People do implement custom assembly
languages on FPGAs quite often, for custom applications that benefit from it.
I've done it, people I work with have done it, but each time in quite
specialised applications.

(CPU manufacturers use arrays of FPGAs to simulate their CPU designs too.)

Processors in memory is another thing entirely. These are actively being
worked on.

GPUs using HBM exist, where the HBM RAM is a stack of silicon dies laid
directly on top of the GPU die, with large numbers of vertical
interconnections. These behave similarly to processors-in-memory, because
there are a lot of processing units (in a GPU), and a lot of bandwith to reach
the memory from all of the units.

Some studies show diminishing returns from increasing the memory bandwidth
further, with the present GPU cores and techniques, so it's not entirely clear
that intermingling the CPU cores with the RAM cores on the same die would
bring much improvement.

There is a physical cost to mingling CPU and bulk RAM on the same die, which
is that optimal silicon elements are different for CPUs than for bulk RAM, so
manufacturing would be either more expensive, or make compromise elements.

~~~
floatboth
> HBM RAM is a stack of silicon dies laid directly on top of the GPU die

nitpick: actually they're off to the side, connected via the interposer:

[https://images.idgesg.net/images/article/2019/02/amd-vega-
ra...](https://images.idgesg.net/images/article/2019/02/amd-vega-radeon-vii-
hbm-100787435-large.jpg)

~~~
jlokier
Thanks!

------
jcranberry
I was under the impression that CISC processors have utiliized instruction
pipelining for a long time already.

Here's what Wikipedia has to say about it:

 _The terms CISC and RISC have become less meaningful with the continued
evolution of both CISC and RISC designs and implementations. The first highly
(or tightly) pipelined x86 implementations, the 486 designs from Intel, AMD,
Cyrix, and IBM, supported every instruction that their predecessors did, but
achieved maximum efficiency only on a fairly simple x86 subset that was only a
little more than a typical RISC instruction set (i.e. without typical RISC
load-store limitations). The Intel P5 Pentium generation was a superscalar
version of these principles. However, modern x86 processors also (typically)
decode and split instructions into dynamic sequences of internally buffered
micro-operations, which not only helps execute a larger subset of instructions
in a pipelined (overlapping) fashion, but also facilitates more advanced
extraction of parallelism out of the code stream, for even higher
performance._ [1]

And Intel has an article on how instruction pipelining is done which covers
CISC designs as well.[2]

1\.
[https://en.wikipedia.org/wiki/Complex_instruction_set_comput...](https://en.wikipedia.org/wiki/Complex_instruction_set_computer#CISC_and_RISC_terms)

2\. [https://techdecoded.intel.io/resources/understanding-the-
ins...](https://techdecoded.intel.io/resources/understanding-the-instruction-
pipeline/)

~~~
Symmetry
When RISC first came out nobody knew how to do pipelining with a CISC ISA. But
a half decade and many engineering-years of effort later Intel was able to
bring out a pipelined x86 processor. But it probably would have taken even
longer with a CISCier ISA than x86 which lacks indirect loads and stores.

------
leoc
> The problem is that a modern CPU is so fast that just accessing the L1 cache
> takes anywhere from 3-5 cycles.

Access to RAM on a MOS 6502 is about 2-6 cycles, isn't it? The total amount of
RAM in a 6502 system is usually roughly the size of a modern L1 cache as well.
It would be interesting if you could just program with an L1 exposed and
structured as a main memory rather than a cache ...

~~~
AtlasBarfed
With the embarrassment of riches in silicon real estate, I'm surprised there
aren't SoC chips with a hundred or more megs of on-chip RAM. If we have room
for 32 cores, then there have to be applications for 8 cores and RAM on-chip.

My old 486 from the early 90's HDD almost fits in modern L3 cache.

~~~
janoc
That RAM costs both in terms of space and power/heat. Especially high-speed
cache RAM. That would make the SoCs huge and expensive for no particularly
good reasons.

Also large caches need a lot of logic to ensure coherency between multiple
cores - with a small cache the probability of conflict isn't that big, so you
can afford to keep it simple. With a huge cache such conflicts between cores
would be pretty much assured and you would have to dedicate a lot of silicon
just for managing access.

------
fortran77
Just like the saying "never bet against JavaScript" has been proven true,
"never bet against CISC" is also true. Just when you think it's become way too
complicated, expensive, and inelegant, it keeps chugging along.

------
bullen
Might the solution be to give the programmer "manual" access to all levels of
memory, so that we can choose which memory to use when, from what core. For
the same reason that the OS should not decide which core runs what thread; you
cannot progress unless you give people the power to improve things.

In extremis this also means the mono-kernel has to go, but then we're talking
big a change.

Anyway I don't think the improvements are going to come from hardware or
software only, we need to improve both simultaneously together, which is
complex and often requires one (or atleast very few) person(s) to do the job.

------
gumby
The value of RISC these days is that the ops are sufficiently small that
computer-assisted humans can write not a JVM but an x68VM-code interpreter
distributing execution across a number of functional units simultaneously

Even that is super hard given that the leading producer of hardware
interpreters for this VM shipped hardware with the SPECTRE vulnerability.

------
jasonhansel
...RISC-V supports SIMD though.

~~~
maeln
Thats not the point of the article. I doesn't say that RISC-V doesn't support
SIMD, it argue that in the real world, we prefer big instruction like SIMD
that do a lot of thing "at once", which is not what RISC-V as been designed
for even though it support some.

~~~
nickik
RISC-V from the beginning was designed to work perfectly fine with vector
processing. The lab that developed RISC-V made RISC-V to have an architecture
that would allow them to use different vector units and experiment with
different vector architectures. The Berkley Parallel Comouting Lab is where
the worked on RISC-V.

The RISC-V V extention is the result of this work and exploration into
parallel architecture.

RISC-V was to be modular and the V extention is just as much part of RISC-V as
the F extention.

~~~
maeln
I honestly don't have any opinion about it as i am not knowledgeable about
this thing. I was just re-stating what the article is saying, and it didn't
say that RISC-V doesn't have SIMD.

------
ohiovr
What about SGI Onyx

[https://m.slashdot.org/story/12659](https://m.slashdot.org/story/12659)

Seems that mips was scalable back then

There was a story headline about a 16 core RISC V just the other day

~~~
ajross
That's not the scalability the article is talking about. It's about
instructions per cycle, and the fact that this gets pretty firmly capped by
the bandwidth and latencies of the cache hierarchy, so even with ~6 parallel
execution units it's pretty rare to be able to fill more than 2 of them in a
cycle. And this is true, but it has absolutely nothing to do with instruction
architecture.

The rest of the article is talking about how VLIW "solves" this problem, which
is does not. Given an existing parallelizable problem, a VLIW architecture can
encode the operations a little more efficiently, and the decode hardware
required to interpret the instructions can be a lot simpler. But that's as far
as it goes. If you have a classic CPU which can decode 4 instructions in a
cycle, then it can decode 4 instructions in a cycle and VLIW isn't going to
improve that except by reducing die area.

VLIW also forces compilers to make a choice between crazy specificity to
specific hardware architectures, or an insanely complicated ISA with a
batching scheme a-la Itanium. This is largely why it failed, not that
"compilers weren't smart enough".

There's nothing wrong with RISC. Or classic CISC. Even VLIW isn't bad, really.
The sad truth is that ISAs just don't matter. They're a tiny bit of die area
and a few pipeline stages in much larger machines, and instruction decode is
largely a solved problem.

Programmers just like to yell about ISA because that's what software touches,
and we understand that part.

~~~
gpderetta
Amen.

------
dis-sys
from the article -

People still call ARM a “RISC” architecture despite ARMv8.3-A adding a FJCVTZS
instruction, which is “Floating-point Javascript Convert to Signed fixed-
point, rounding toward Zero”.

The fundamental issue that CPU architects run into is that the speed of light
isn’t getting any faster. Even getting an electrical signal from one end of a
CPU to the other now takes more than one cycle.

~~~
vardump
No idea why you've been modded down. Yeah, ARM arch is not that RISC anymore,
there are a tons of instructions. Some pretty specific like that one. I don't
see much of point of polarizing CISC vs RISC anymore in the first place. Both
are borrowing from each other and gradually blending together.

