
RISC instruction sets I have known and disliked - jsnell
http://blog.jwhitham.org/2016/02/risc-instruction-sets-i-have-known-and.html
======
russell
Historical nit: I always considered the CDC 6600 to be the first commercial
RISC machine, although given its strange architecture, I can see that others
might disagree. It had multiple floating point and integer processors. An
assembly programmer had to be aware of them all. I would not write two FP
divides in a row because the second would stall waiting for the first to
finish. I could write two consecutive FP multiplies, because there were two FP
multipliers. Instruction timings were always a consideration in selecting
registers, because you wouldnt want to try using a register that was the
target of another instruction until that instruction had completed.
Fortunately there were interlocks so that you would get the register contents
expected rather than some undefined intermediate state. You always had two or
three parallel instruction flows going to take ad vantage of as many of the 10
or so processors available.

Other aspects of the architecture were truly strange. There were no load or
store instructions. They were a side effect of setting an address register. I
was the lead developer for two of the PL/I compilers for the 6600. Much fun.
For those interested in strange architectures, I recommend the Wikipedia
article
[https://en.wikipedia.org/wiki/CDC_6600](https://en.wikipedia.org/wiki/CDC_6600).

~~~
Animats
That was Seymour Cray, and his Cray machines were even more RISC-like. The
Cray I was a very simple machine; it just had 64 of everything.

------
mpweiher
> RISC instruction sets like PowerPC are usually expected to be highly regular
> [..] whereas the CISC style of instruction set is expected to be highly
> irregular and full of oddities

No. Not true at all. Not even close. CISC instructions are expected to be
_complex_ , hence the name. MC68K was pretty regular, NS32032 highly regular,
both CISC.

RISC are _reduced_ , not _regular_ , so for example you'd expect memory access
to only occur with specific memory access instructions, whereas all arithmetic
and other computation only deals with registers.

So CISC was usually _more_ regular, not less. x86 is the exception, because it
was just extensions heaped on top of extensions: 8080 8 bit -> 8086 segmented
16 bit + 20 bit addresses -> 80286 protected 16 bit segmented with 24 bit
addresses -> 80386 semi-segmented/mostly flat 32/32 bit, etc.

~~~
nickpsecurity
That's a point I haven't seen before. Yes, many x86 haters who tend to like
RISC architectures typically praise M68K's for their ISA. Clearly there's
something else outside of RISC and CISC. You might have figured it out. Maybe
not. Worth thinking on.

------
Symmetry
CISC has always been much nicer to hard write assembly code in than RISC is.
The only reason RISC was able to take off was that most code started to be
generated by compilers rather than through assemblers.

Given the advance of modern computer architecture that the author talks about
many of the old advantages of RISC no longer apply. If you're going to be
doing out of order execution the the extra effort in implementing some extra
instructions really isn't important for application processors.

The big advantage that RISC has these days is that fixed width instructions
are easy on the decoder. You can also have variable width instruction that use
UTF8esque byte marking to make things easier on the decoder but x86 doesn't
have anything like that. But then again separating them entirely in the ISA
makes thing easier on the designers.

Oh, and it's a bad idea to touch memory multiple times in a single instruction
on a modern machine but Intel's optimization manuals warn you not to do that
and compilers abide by those warnings. If you want an ISA feature that's
really hard to design into a high performance uArch then there's indirect
addressing but unlike most CISC ISAs x86 managed to avoid that one.

The advantages of RISC might be overblown in some sense, but there have been a
lot of new instruction set architectures developed over the last 20 years when
the RISC/CISC debates were raging. Many of those have been weird in various
ways but almost all look a lot more like RISC instruction sets than CISC
instruction sets and it's not just because people are following the herd.

And I've got the sense that my inside view of the issue is underestimating now
advantageous RISCishness is. When ARM had the opportunity to redesign their
ISA when the transitioned to 64 bits they simplified their ISA quite a bit and
increased the number of registers from 16 to 32, basically making their ISA
much more similar to a classical RISC design. I don't really understand why
they thought that going that way was an advantage but it seems like people
actually involved with designing these things instead of just thinking about
them in their armchairs still think that RISC has a lot of advantages.

~~~
yokohummer7
> The big advantage that RISC has these days is that fixed width instructions
> are easy on the decoder.

One thing that I've wondered is how much more efforts are needed to decode
variable-width instructions. Decoding itself sounds fairly easy (but frankly I
don't know any details), to the point that the amount of time needed for
loading/storing/calculating overwhelms that of decoding. But decoding should
happen extremely fast to fill the pipeline, so the speed might still matter.
Can decoding instructions be an actual bottleneck?

~~~
dbcurtis
OH, my, yes, it can become a bottleneck. Disclaimer: It has been a good many
years since I was privy to the innards of an X86.

In the X86, it is possible for an instruction to be from 1 to 15 bytes long.
(Maybe more today? It was 15 when I cared.) All you can tell from looking at
the first byte is that it is either one byte longer than one byte. All you can
tell from the 2nd byte is that it is either 2 bytes or longer than 2 bytes,
and so on. When you walk all the way out to the 15th byte, you might find a
MOD/RM field, which may contain invalid combinations. _Finally_ you have
enough information to raise (or not) the illegal instruction exception. That
is one very nasty equation.

Just one example of how variable instructions can become annoying to a logic
designer. OTOH, some machines are very regular in how instruction length is
specified -- in IBM 370 code, for instance, you can look at the first 2 bits
and know the instruction width. X86 is an example of organic accumulation of
features over time leading to a large collection of special cases.

~~~
userbinator
The majority are below 4 bytes though, and ModRMs are either the 2nd or 3rd
(in case of 0F escape or other prefix) byte. The 15-byte limit still applies,
and is very rarely approached. As I understand it, modern x86 decoders can
handle (multiple of) the smaller instructions in one cycle, while longer ones
take a cycle or two more.

~~~
Animats
Intel and AMD approach this differently. Intel decodes a few instructions
ahead of execution, and sometimes decodes speculatively. AMD at one time was
expanding an entire cache line to fixed length instructions and executing the
decoded form.

X86 allows you to store into code, even immediately ahead of execution. This
made sense in the 1970s when Harry Pyle designed the instruction set and CPUs
were slower than memory. Superscaler CPUs have to support this. But, since
almost nobody does that any more, they don't do so efficiently. Storing into
code near execution causes an exception event, flushing all the superscalar
lookahead and backing up to just before the instruction doing the store into
code. Then the code gets modified, and the pipeline reloads, having lost tens
to hundreds of cycles.

------
0x0
The IA64 (Itanium) series on Old New Thing seems to highlight a very crazy
instruction set and architecture:

[https://blogs.msdn.microsoft.com/oldnewthing/20040119-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20040119-00/?p=41003)

[https://blogs.msdn.microsoft.com/oldnewthing/20150805-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20150805-00/?p=91171)

------
KMag
It's unfortunate he didn't touch more on the DEC Alpha AXP. It was designed
from the start to be a 64-bit chip, unlike most 64-bit ISAs in use today.

Sure, it took them a bit to be convinced that single-byte loads and stores
were worthy of dedicated instructions.

It was designed so that most of your kernel code wasn't actually running in a
privileged CPU mode, but instead make upcalls to PAL code, a sort of super
lightweight hypervisor that emulated how ever many of rings of protection the
kernel needed. (Ultrix and Linux needed 2 rings. VMS ran on another set of
firmware that emulated more rings.)

It was a nice clean design that was running at 500 MHz back when Intel could
manage 200 MHz. Its memory model was more friendly to parallel execution (and
less friendly to compiler and JIT writers) than the x86 memory model, forcing
weaker consistency guarantees out of the JVM memory model as a result.

It seems a shame to me that the architecture was never revived. I'd like to
hear more about its quirks and flaws.

------
bazizbaziz
Kinda seems like the compiler just shouldn't allocate r0 for inline assembly
on PPC, since it's only valid in special circumstances. Hard to fault the ISA
a lot since this is basically the compiler backend author(s) missing a corner
case, which is quite easy to do considering the breadth of a compiler backend.

~~~
msbarnett
Yeah, that's not about the ISA, that's just Evidence That GCC's Inline ASM
Functionality is a Mess #938292721

See also: [http://free-electrons.com/blog/how-we-found-that-the-
linux-n...](http://free-electrons.com/blog/how-we-found-that-the-linux-
nios2-memset-implementation-had-a-bug/)

See also: [http://robertoconcerto.blogspot.ca/2013/03/my-hardest-
bug.ht...](http://robertoconcerto.blogspot.ca/2013/03/my-hardest-bug.html)

~~~
comex
Alternately it could be seen as evidence that PowerPC assembly _syntax_ is a
mess. For anyone who doesn't know, the way it works with typical PowerPC
assemblers is that instructions take unadorned numbers for all arguments, and
determine whether they refer to registers or immediates based on the
instruction: "li 1, 2" sets R1 to the immediate 2 ("load immediate"), while
"mr 1, 2" sets R1 to the value of R2 ("move register"). And then because
people find bare numbers confusing, you have includes that do "#define r1 1"
or equivalent for each register, so when writing assembly manually you can
write "mr r1, r2". But because these are just dumb macros, nothing stops you
from writing "li r1, r2" \- the assembler will just macro expand r2 to 2 and
treat it as an immediate!

Other architectures have the R prefix as an intrinsic part of the syntax, so
if you write R2 in a slot where the instruction requires an immediate, you'll
just get an error. If PowerPC did that, you'd still need to remember the right
inline assembly constraint letter for GCC, but getting it wrong would 'just'
result in an unpredictable compile error when the compiler decided to use r0,
not silent misbehavior.

~~~
mpe
At least with GNU as you can use %r1, %r2 etc. as an "intrinsic part of the
syntax". Which means you can't use a register name where an immediate is
expected.

However that doesn't fix the gotcha with r0 being special, that is specified
in the ISA. In fact it's that way precisely so you can load an immediate
without needing a separate opcode.

~~~
comex
Huh, never knew that... but I just tried it and GAS (the version Debian
installed as powerpc-linux-gnu-as, at any rate) accepted "lwz %r0, %r5(%r0)".
Snatching defeat from the jaws of victory...

It would still be a gotcha, but a pretty minor one if messing it up just
resulted in an error. I suppose the approach taken by AArch64 and others is
preferable, where one register is just completely reserved as constant 0
rather than only in some encodings.

------
Ericson2314
IMO the the success of CPU lines in recent years had almost nothing to do the
intrinsic properties of the instructions set.

I don't blame MIPs from doing delayed branching etc because _if_ most machine
code is compiled from something else, and _if_ most compilers use decent
abstractions, one should be changing ISAs all the time to adapt and bolster
the latest and greatest implementation techniques. (e.g. for out-of-order
super scalar, its probably best to give the CPU some sort of dependency
graph.)

The focus on hand-coding as a way to get to know the architecture, on the
other hand, borderline insinuates that instructions sets should optimize for
hand-coding, which is just plainly ridiculous.

~~~
userbinator
On the other hand, no one wants to have to recompile everything all the time,
which gives much force to the argument that CPUs should be more CISC, so that
the same (complex) instructions will simply run faster due to hardware
improvements. REP MOVS on x86 is a great example of this; it was originally
the fastest way to do a block copy until around the Pentium when it lost (only
slightly) to very large custom unrolled loops, but since ~P6 it has been
internally optimised to copy whole cache lines at once and in the very latest
microarchitectures it is once again the fastest.

~~~
Ericson2314
Well, as a NixOS User (which granted way post-dates RISC), I get the all the
benefits of constant recompilation without burning any of my own CPU cycles.

Very good point on the `REP MOVS` front (and cool story!). Indeed if the Mill
pans out it would instantly usher the renaissance for branch delay.

So yeah, I rather recompile than try to predict future architecture trends,
but either way, the grossness should be there for performance not hand-coding
ease.

------
aidenn0
A few comments:

Yes, MIPS assembly is terrible, though I hear that the newer ISA revisions are
better (the one I worked with most heavily was the 5k, and systems programming
on it was very unpleasant).

Power and MIPS also both now have thumb-2 alike instruction encodings, POWER
VLE and MIPS16e, respectively. It turns out that code size matters.

Lastly, RISC no longer means what it used to. It basically is used today to
just mean a load-store architecture, as you now have variable-length
instructions, multi-cycle arithmetic instructions, out-of-order superscalar
chips labled "RISC"

When IBM came out with the POWER ISA, I seem to recall that one of the authors
of the abacus book claimed it wasn't simple enough to be RISC, which is a
quaint thought these days.

~~~
Symmetry
It's sort of funny that ARM, which is really the CISCiest of the old RISC
instruction sets, has ended up being the most successful one.

[http://userpages.umbc.edu/~vijay/mashey.on.risc.html](http://userpages.umbc.edu/~vijay/mashey.on.risc.html)

~~~
renox
ARM, which is really the CISCiest of the old RISC instruction sets, you can
replace 'is' by 'was' since ARMv8.

------
twic
_However, many of the later RISC architectures do share one annoying flaw.
[...] the mechanism for storing 32-bit immediates can only encode a 32-bit
value by splitting it across two instructions: a "load high" followed by an
"add". [...] This design pattern turns up on almost all RISC architectures,
though ARM does it differently (large immediates are accessed by PC-relative
loads)._

He neglects to mention the ARM's clever approach to this:

[https://alisdair.mcdiarmid.org/arm-immediate-value-
encoding/](https://alisdair.mcdiarmid.org/arm-immediate-value-encoding/)

~~~
to3m
The ARM approach is clever, but you only load 8 bits. Suppose you need a
32-bit constant - it's going to take 4 instructions. Compared to the halfword
instructions, you do win with certain awkward constants such as 3<<15\. That's
why neither approach is as good as having variable-length instructions ;)

This made me wonder how common such constants are. So I grabbed some code I've
been working on recently, for which I happened to have assembly language
output, and searched for every immediate constant. (This must be the first
time I've found a good use for gcc's nasty AT&T x64 syntax.)

My code targets x64, so take the "analysis" with a pinch of salt. Out of the
8982 instructions that had immediate operands, there were 819 unique 32-bit
constants. 762 (93%) were high-halfword or low-halfword only, so they could
have been loaded with one halfword instruction. By comparison, only 392 (48%)
could been loaded with one instruction on ARM.

Ten constants were better for ARM, in that they would take two halfword
instructions to form, but only one MOV or MVN: ['0x00ffffff', '0x03ffffff',
'0x0fffffff', '0x3fffffff', '0x7fffffe8', '0x7ffffffe', '0x7fffffff',
'0x80000003', '0xfffffffe', '0xffffffff']. These ten constants were used by 84
instructions out of the 8982.

~~~
pm215
Modern ARM has load halfword insns too, so you can use those or the 8-bit imm
encoding depending on the constant.

For a full 32 bit value prior to movw/movt you'd most likely load it from a
constant pool rather than do a 4 insn sequence. (Some 32 bit values can be
done with clever choice of 8imm sequences -- there's an algorithm you can use
as a compiler to say "given this value can I create it in 3 or less insns?",
which is worth the effort if you're targeting a pre-movw ARM cpu.)

------
chrismonsanto
> x86 is not particularly nasty

x86's terrible reputation is well deserved. x86-64 fixes a number of problems,
but for most of x86's lifetime we've had to live with 8 registers and the lack
of IP-relative addressing...

Let's not forget how awful x87 floating point was, either!

~~~
viraptor
I'm confused why would IP-relative addressing be useful. Have you got some
interesting examples?

~~~
gsg
It's the basis of efficient relocations in position independent code, which is
now very common.

PIC can be emitted on older x86 machines without RIP-relative addressing, but
the code is larger and slower. As an example, consider -m32 gcc output for the
C program

    
    
        int x;
    
        int getx() {
            return x;
        }
    

With no PIC:

    
    
        getx:
    	movl	x, %eax
    	ret
    

With PIC:

    
    
        getx:
    	call    __x86.get_pc_thunk.cx
    	addl    $_GLOBAL_OFFSET_TABLE_, %ecx
    	movl    x@GOT(%ecx), %eax
    	movl    (%eax), %eax
    	ret
    

And with -m64, which emits PIC and uses x86-64s RIP-relative addressing:

    
    
        getx:
     	movl	x(%rip), %eax
    	ret
    

Hopefully that makes the motivation clear.

~~~
viraptor
Thanks, I've seen it so many times it seems I developed (%rip) blindness :) Of
course it's useful this way.

------
Dylan16807
> At this point, it's probably better to have an efficient instruction
> encoding, save on memory bandwidth and instruction cache space, and have a
> comprehensible instruction set. Hence x86.

x86 is full of legacy single-byte instructions and complicated prefixes,
hurting space efficiency and making it a huge pain to have high-throughput
decoding. You could do a lot better if you took the x86 instruction list and
reassigned all the encodings.

------
Zardoz84
> However, many of the later RISC architectures do share one annoying flaw.
> Immediate values are constants embedded within the instructions. Sometimes
> these are used for small values within expressions, but often they're used
> for addresses, which are 32-bit or 64-bit values. On PowerPC, as on SPARC
> and MIPS, the mechanism for storing 32-bit immediates can only encode a
> 32-bit value by splitting it across two instructions: a "load high" followed
> by an "add". This is a pain. Sometimes the two instructions containing the
> value are some distance apart. Often you have to decode the address by hand,
> because the disassembler can't automatically recognise that it is an
> address. This design pattern turns up on almost all RISC architectures,
> though ARM does it differently (large immediates are accessed by PC-relative
> loads). When I worked on an object code analyser for another sort of RISC
> machine, I gave up on the idea of statically resolving the target of a call
> instructions, because the target address was split across two instructions,
> one of which could appear anywhere in the surrounding code.

> The x86 system for storing 32-bit/64-bit immediates is much nicer. They just
> follow the instruction, which is possible because the instruction length is
> variable. Variable-length instructions are not usually seen in RISC, the
> Thumb-2 instruction set being the only exception that I know of.

Hybrid way, that could be the best of both worlds :
[https://github.com/trillek-team/trillek-
computer/blob/master...](https://github.com/trillek-team/trillek-
computer/blob/master/cpu/TR3200.md#instructions-format)

On a few words, it uses a bit to know if the literal would be bigger that
could normally stored on a instructions of 4 bytes. If it's true, the next 4
bytes is the literal value.

------
0x0
Shouldn't gcc (or a similar helper tool) be able to understand the side
effects of the asm instructions and automatically fill in all the clobber
flags, instead of manually having to fill in all those crazy =r style markers?
Is it not possible to code something that determines all affected registers
for a given set of assembly opcodes?

~~~
msbarnett
It's certainly possible to do a lot better than GCC's inline assembly setup,
which has proven time and again to be a usability disaster that positively
encourages writing subtly buggy code.

Microsoft's C compilers do a much better job with inline asm by being more
conservative in how they allocate registers around the asm
([https://msdn.microsoft.com/en-
us/library/k1a8ss06.aspx](https://msdn.microsoft.com/en-
us/library/k1a8ss06.aspx)).

CodeWarrior's PPC compilers remain the ne plus ultra of inline assembly
ergonomics, and it's a goddamn shame that LLVM has chosen (for pragmatic
reasons) to follow GCC's mediocre lead rather than pursue something akin to
it.

~~~
cokernel_hacker
Clang provides both GCC and MSVC style inline assembly.

~~~
msbarnett
Holy crap, I was not aware of that.

edit: this doesn't seem well-documented, if true. The Clang site mostly just
talks about GCC compatibility, eg)
[http://clang.llvm.org/compatibility.html#inline-
asm](http://clang.llvm.org/compatibility.html#inline-asm)

------
nn3
Can anyone clarify his claim on MIPS data hazards? I didn't follow that one.
To my knowledge MIPS has no special hazards, like a VLIW ISA would have. Is
that not correct?

~~~
yuubi
MIPS, the Microprocessor without Interlocked Pipeline Stages, doesn't make
sure the result of one instruction is available before executing an
instruction "later" in the stream that refers to that result. Something like
(pseudo-asm with C-notation comments)

    
    
         ld r1, @r2   ; r1 = *r2
         add r3,r1,r4 ; r3 = r1+r4
    

wouldn't set r3=*r2+r4 because the memory access hasn't finished by the time
the add runs.

~~~
aidenn0
And this got really fun when superscalar MIPS processors with instruction
prefetching came out. They had to introduce a different NOP called SSNOP that
stalled all ALUs. Obviously they couldn't just declare "NOP stalls all ALUs"
as that would have serious performance affects for places where NOPs are
necessary (e.g. branch delay slots).

~~~
protomyth
I'm curious why they still had NOP as part of the name since it actually did
something.

~~~
aidenn0
Well NOP does nothing on one ALU, SSNOP does nothing on all ALUs.

And to give you an example of how you had to calculate things:

There were, if my memory serves me correctly 6 pipeline stages on the 5k
numbered 0-5, plus instruction prefetch which was numbered -1. You subtracted
the stage in which the instruction took effect from the stage in which a
subsequent instruction needed to see that effect, and the result was the
number of intermediate stages that all ALUs would need to go through.

Worst case scenario would be if you were modifying RAM that would be read as
an instruction; it wouldn't take effect until stage 5, and instruction
prefetch was stage -1 so you needed to make sure all ALUs were busy for 6
clock cycles. In theory you could do the math to figure out the scheduling for
each ALU, but I just dropped 6 SSNOPs in there, since it was a code path that
was only hit during loading of a new process, 6 wasted clock cycles was not a
concern.

Note that this is unrelated to interlocks, as any Modified Harvard
Architecture will require some sort off synchronization when changing the
instruction stream. However, most ISAs have a single instruction that stalls
the pipeline and discards any prefetched instructions (e.g. isync on Power).
They added one in later revisions of the MIPS ISA as well.

Another fun thing was that there was no interrupt-safe way to disable
interrupts, as the interrupt-enabled bit was in a word-sized register along
with other values that could legitimately be changed by an ISR. This was also
fixed by later revisions of the ISA.

------
planteen
"If you do have the misfortune to have to work with SPARC"

So true... I work with the LEON (GPL implementation of SPARC) and have the
occasional very bad day of SPARC asm code.

------
maaku
Sad that there was no mention / evaluation of RISC-V in this post, which
attempts to resolve exactly the problems he identifies...

~~~
microcolonel
I don't think RISC-V really addresses the encoding inefficiency problem,
except for the "C" extension, sorta. Though I don't think that for OoO
superscalar architectures, icache pressure is as much of a problem as it is on
a fancy vliw.

But yeah, would be nice to get a take on RISC-V in context of this rant.

~~~
_chris_
Huh?

RISC-V with the compressed extension is incredibly efficient in its encoding.
Better than x86 or ARM in both static and dynamic bytes per program.

Also, Icache pressure is a _huge_ problem in modern warehouse-scale computers.

Any processor that cares about performance will almost certainly be
implementing the C extension to RISC-V. It also enables more efficient macro-
op fusion, turning common two instruction 4-byte idioms into a single, more
powerful instruction.

~~~
microcolonel
Thanks for going into more detail. I was basing my assumption that it wasn't a
huge problem on the fact that the only people who complain about it _first_
seem to the folks designing the Mill. They have a ridiculous/insane/cool
solution to it.

Everyone else seems to first mention their cool branch predictor, or vector
processor.

------
TazeTSchnitzel
I wonder what the author would think of the Mill.

------
DonHopkins
That's "codesign" as in "co-design" not "code-sign".

