
The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat - ingve
http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-130.html
======
ChuckMcM
It _always_ starts that way. We don't need a special instruction for X because
we're so much faster and cost effective. And so far, that reasoning has always
fallen under the need for "just one or two" special instructions to make what
ever the killer algorithm of the day is, go a bit faster or smoother.

The thing is, since transistors are so cheap and plentiful these days you
either waste them on bigger and bigger on chip caches or you use them for some
special purpose action (like a fancy instruction or maybe a simple co-
processor).

RISC was compelling when it was hard to make transistors, and simple has
always been compelling in the quest for correctness, but simple != RISC.

~~~
_chris_
The title is cutting out an important part of my report's story: "Avoiding ISA
Bloat with __Macro-Op Fusion __".

You can still specialize your processor to eke out more performance, but you
(often) don't need to change the ISA to do it!

Why add load-pair instructions (like ARMv8 did) when you can simulate the
exact same effects using macro-op fusion? And code size doesn't matter, since
RISC-V's compressed ISA extension makes RISC-V the densest ISA out there
without throwing away performance to get it.

~~~
ChuckMcM
Let's start with I really like the RISC-V architecture, its compact, it is
easy to understand and that it is "open" in the licensing sense is even more
awesome. But I also think you've stepped into what I think of as the "known
requirements" fallacy.

Fundamentally the fallacy is this; When you are looking backwards in time at
all of the accommodations existing systems made to adapt to changing
requirements, you can see a path that those systems missed. It is similar
perhaps to standing on the high ground over looking a valley and seeing a path
that people on the ground do not see. The fallacy is believing that somehow
you are a better pathfinder than they are and you would not make their choices
in their position.

Whether it was the saturating adds and multiplies that came with the DSP
requirements, or the SIMD extensions that came with graphics, these systems
live in an evolving ecosystem of computation which challenges their
architectural invariants again and again. And commercially at least this
pressure to adapt has so far had a perfect record of overwhelming the core
architectural tenets and producing a small wart which later becomes a larger
wart and then a system that, when you back on it, could have been built
differently had they known the requirement that was going to be thrown at it.

There will always be a case to throw out the currently dominant ISA and
replace it with one without the warts of the existing system. And the cost of
that will be high because it means a very large software migration burden. And
it is that cost which allows the warts to exist in the first place.

I think you have done some good work here. I particular like the analysis for
the cost / benefit of adding instructions vs macro ops. I believe that will be
a useful tool for doing architecture analysis going forward. I also found
myself strongly rejecting the notion that that particular implementation of
"macro op fusion" was sufficient to innoculate your ISA from any future
changes. (which is implied by the abstract and endorsed by the conclusion)

~~~
_chris_
> The fallacy is believing that somehow you are a better pathfinder than they
> are and you would not make their choices in their position.

RISC-V has two huge advantages going for it; 1) they got to learn from 40
years of mistakes, and 2) they recognized how important it was for the ISA to
be agnostic to the micro-architecture (I'm amazed at how I can tell you about
details such as the number and types of ports on a processor's register file
just based on reading the ISA manual!). I don't think the RISC-V authors
pretend they would not have made the same mistakes back then, but rather, they
are annoyed that the same mistakes keep being made today! ARMv8 is a newer ISA
than RISC-V and yet I believe it violates #2 (why are their branches 8
bytes?), and the same SIMD design mistakes are still being made over and over
again.

> I also found myself strongly rejecting the notion that that particular
> implementation of "macro op fusion" was sufficient to innoculate your ISA
> from any future changes. (which is implied by the abstract and endorsed by
> the conclusion)

I don't mean to imply that there are no new instructions that will ever need
to be added! I think popcount is an example that is very painful to handle in
software, and the idiom in assembly is probably too big to handle via macro-op
fusion. I'm also very excited to eventually have a vector ISA in RISC-V.

Rather, my abstract is fighting against a whole slew of instructions that
exist in current ISAs (or that people want to add to RISC-V) that don't need
to be there. The load-pair/load-increment types that you see in ARMv8 come to
mind.

~~~
ChuckMcM
Fair enough, have you asked the ARMv8 architect why he added them? He
mentioned at ARM TechCon that there was a lot of thought that went into each
change they made.

[1] Richard Grisenthwaite, Lead Architect and Fellow. ARM --
[https://www.arm.com/files/downloads/ARMv8_Architecture.pdf](https://www.arm.com/files/downloads/ARMv8_Architecture.pdf)

~~~
_chris_
I would love to chat with the ARM guys and pick their brains. I suspect for a
few things (like unfused branches) they wanted to match the existing micro-ops
they were already generating since they would still have to support ARMv7!

But I'd also like to know why they didn't have things like AMOs (and instead
had to add them in v8.1). And lastly, I would like to know why there is no
Thumb for AArch64, and why it isn't the default ARMv8 behavior!

------
_chris_
Oh man, that was fast. I didn't tell anybody I had released this yet, and here
it is!

Uhhh, AMA?

~~~
nkurz
Thanks, I've skimmed though it and have a few questions and comments. I know
very little about RISC-V, but I've recently been diving into issues of macro-
and micro-fusion on recent Intel x64.

The first question would be: Why do we care about instruction counts?
Historically it's been used as a proxy for computational cost, but as you
point out it no longer correlates that well. Additionally, many modern
processors have µop caches that greatly reduce the importance of the initial
instructions. But instead of suggesting that we ignore instruction count, the
paper seems to suggest that we need to consider both instruction and µop
count. Why?

You mention that "Compiler tool chains are a continual work-in-progress" and
mention that you "used GCC for all targets as it is widely used and the only
compiler available for all systems". This makes me wonder how much you are
testing the differences in ISA's versus how much you are just testing GCC's
ability to optimize for each.

My strong impression is that GCC does a poor job of optimizing macro- or
micro-fusion for x64. It seems mostly oblivious to the issues, and the
efficiency of the resulting code is equally likely to be anywhere from the
same speed as microarchitecture conscious assembly to 2x slower. Did you
attempt to correct for this when choosing x64 as the baseline?

For example, I noticed the x64 code in 401.bzip in Code 2:

    
    
      n = ((Int32)block[ptr[unHi]+d]) - med;
    
      4039d0: mov (%r10), %edx
      4039d3: lea (%r15,%rdx,1), %eax
      4039d7: movzbl (%r14,%rax,1), %eax
      4039dc: sub %r9d, %eax
      4039df: cmp $0x0, %eax 
      4039e2: jne 403a8a
    

While some of this is due to the requirements of the specific integer widths
(intentionally? arbitrarily? accidentally?) specified in the source, GCC seems
to have come up with a very roundabout solution. Why a MOV/LEA rather than a
micro-fused ADD load-op? For that matter, why use an LEA (which is restricted
to fewer execution ports) rather than an ADD?

And while in this case it happened to keep the CMP/JNE fusible, why does it
have a CMP there at all, instead of using the flags already set by the SUB?
From Sandy Bridge onward, SUB/JNE is also a macro-fused single instruction, so
instead of two instructions fusing to one we have three instructions fusing to
two.

There are probably reasons for all this, but it does make it it awkward as a
baseline. Whether it makes it to the paper or not, verifying that the baseline
x64 numbers don't change significantly with Clang or ICC with a variety of
optimization flags might be useful. If you were heroic, comparing to a version
with hand-optimized would be interesting as well.

A last comment is that I recently realized there were some significant
differences between the Haswell and Skylake generations with regard to the
effects of fusion on performance. A loop like this:

    
    
      #define ASM_MICRO_MACRO(in1, sum1, in2, sum2, max)          \
        __asm volatile ("1:\n"                                  \
                        "add (%[IN1]), %[SUM1]\n"               \
                        "cmp %[MAX], %[SUM1]\n"                 \
                        "jae 2f\n"                              \
                        "add (%[IN2]), %[SUM2]\n"               \
                        "cmp %[MAX], %[SUM2]\n"                 \
                        "jb 1b\n"                               \
                        "2:" :                                  \
                        [SUM1] "+&r" (sum1),                    \
                        [SUM2] "+&r" (sum2) :                   \
                        [IN1] "r" (in1),                        \
                        [IN2] "r" (in2),                        \
                        [MAX] "r" (max))
    

Ignoring that this isn't a particularly useful loop, it executes in at a
single cycle per loop iteration on Skylake, but takes 1.8 cycles per iteration
on Haswell. The amount of fusion is the same on each, but the performance
differs greatly. I'm not yet sure yet of the specific reasons, but it makes me
wonder if the choice of Ivy Bridge as the reference implementation for x64
might be influence your conclusions (although I don't know in which
direction).

￼￼￼￼￼￼

~~~
_chris_
> But instead of suggesting that we ignore instruction count, the paper seems
> to suggest that we need to consider both instruction and µop count. Why?

In part because I have nothing else to go on! Not every processor gives me a
uop count. Instead I'm saying "let's look at instruction count, but with a
glass full of salt." That may be fairly uncomforting, but it's what we have.

But at the end of the day, I care about making fast RISC-V processors, and
this level of analysis was enough to give me the insight I wanted that a) the
gcc generation isn't too bad for RISC-V and b) hey, there's some neat tricks
we can use to make fast RISC-V processors once we see what other ISAs thought
was worth spending their opcode space on.

> why does it have a CMP there at all, instead of using the flags already set
> by the SUB?

I was quite surprised by how little x86 and ARMv7 made use of condition codes.
Perhaps telling how ARMv8 removed most conditional execution from its ISA...

> My strong impression is that GCC does a poor job of ...

That may be (and I was certainly horrified by how badly it does on libquantum
which is perfectly vectorizable). However, I've been told (for at least many
SPEC benchmarks) that gcc does about as well as icc, and well, it's the free
and available compiler that virtually everybody uses, so we're stuck with it!

And the ISA has been around for 30+ years, so if it's still bad, is it the
compiler's fault for not being smarter, or the ISA's fault for being too
complex for the compiler to get it right?

> If you were heroic, comparing to a version with hand-optimized would be
> interesting as well.

There are a million things I'd love to have done (more benchmarks, more ISAs,
more tool-chains...), but this isn't my PhD, and I had to draw the line
somewhere! But if somebody wants a low-hanging Master's Thesis...

> it makes me wonder if the choice of Ivy Bridge as the reference
> implementation for x64 might be influence your conclusions

I'd love to read more about that. In my case, I use what I have on hand, which
is a very expensive Xeon Ivy Bridge!

~~~
nkurz
Great answers. You're doing excellent work!

------
panic
Agner Fog's ForwardCom architecture has some interesting ideas around
extensibility:
[https://github.com/ForwardCom/manual/blob/master/forwardcom....](https://github.com/ForwardCom/manual/blob/master/forwardcom.pdf)

------
unsignedqword
Is there even any point of referring to an ISA as distinctly 'RISC' or 'CISC'
anymore? Modern AMD64 'CISC' (and even some ARM machines, despite being
considered 'RISC') machines are held together behind the scenes by a microcode
with a much smaller arsenal of instructions, and then you have stuff like this
where a RISC-V ISA chip might fuse ops together in a very CISC-like
abstraction.

~~~
_chris_
> Is there even any point of referring to an ISA as distinctly 'RISC' or
> 'CISC' anymore?

First, you need to be careful to not confuse the ISA with the micro-
architecture. RISC and CISC are descriptions of the interface that the SW
sees, and the fact that Intel processor pipelines use RISC-like micro-ops
doesn't change the fact that the x86 ISA is very CISC-y (just look at the
picture of all the x86 registers). Of course, the fact that we can so readily
decouple the ISA from the implementation does help us lessen the sins
committed in bad ISA designs, which is where are RISC arguments come back in
to play - if we can make all ISAs execute at about the same performance, why
not go with the simplest ISA to implement?

Of course, although I'm being provocative in the report, in reality, RISC and
CISC are not binary terms - they're points on a continuum.

For example, RISC-V's RV64I (integer-only) is incredibly RISC-y (not even a
multiply!), but once you add in double-precision FMAs ("D") and atomic memory
operations ("A") and variable-length instructions ("C")... well, it's not CISC
like VAX, but it's certainly not pure RISC!

~~~
Taniwha
The thing is the base x86 architecture ISNT very CISCy, there's only simple
addressing modes (no double indirect modes), no addressing side effects to
undo on exceptions (auto increments), only one TLB dip per instruction

(the push instruction is really the only excepetion)

compared to its compatriots (68k/vax/NSwhatever etc) it was positively RISCy
in comparison - a vax instruction could take something like 29 TLB misses,
make 29 memory accesses - a 68020 threw a horrible mess of microstate spew on
a TLB miss so it could make progress (made unix signals implementation a
nightmare, where do you safely put that stuff?)

I think its RISCyness is why it's still with us

------
kstenerud
tl;dr

RISC-V with C extensions is a bit denser than x86-64 with vector operations
disabled, and can execute common operations in slightly less cycles due to
macro ops (especially in compare-branch idioms).

So until they add vector extensions to RISC-V, x86-64, though bloated, is
still king.

~~~
_chris_
I compiled the x86 code using gcc 5.3 using mtune/march=native and -O3. It
just turns out that the gcc compiler generates entirely scalar code for
SPECInt!

~~~
astrange
That totally seems wrong. I'm certain it should fire on SPECint, since all
compiler optimizations are pretty much invented just to improve the SPECint
score - gcc's autovectorizer was made to do this for IBM's POWER processors,
which are of course not x86.

Maybe it didn't detect '-march=native' properly?

------
stephencanon
The problem is that macro-op fusion, at least as commonly implemented,
requires instructions to be adjacent, or at least close, for fusion to happen.
This means that your careful plan of "not designing the µarch into the ISA"
effectively goes out the window. It's not literally in the ISA, but the
compiler needs to order the instructions _just so_ to benefit from the fusion
opportunities of the µarch that it's targeting, which often penalizes low-
power designs that don't have the fusion in question.

There are some sequences where it makes an enormous amount of sense (cmp+jcc,
mulhi+mullo, etc); most sufficiently complex µarchs already do those. RISC-V
isn't magically going to squeeze a lot more macro-fusion opportunities out
that other architectures _aren 't_ able to take advantage of.

------
alain94040
In your paper, please fix this glaring typo:

// psuedo-code for a ‘repeat move’ instruction

-> pseudo

~~~
_chris_
Yikes, my spell checker missed that. -.- (shakes fist at OSX).

